Proyecto "Determinación de especies según distribución geográfica"¶

Integrantes:

  • Luis Arrieta Arrieta
  • Stefany Solano González

Descripción¶

Este proyecto nace para evidenciar la falta de sistematización de biodiversidad que existe y cómo a pesar de que Costa Rica posee aproximadamente un 8% de la riqueza natural, esta no se visibiliza en bases de datos internacionales. Adicionalmente, en este trabajo queremos explorar los registros existentes en la base de datos del Catálogo de la vida, haciendo un pequeño énfasis en el grupo taxonómico Fungi. Como justificación de este proyecto, el Sistema Global de Información sobre Biodiversidad (GBIF por sus siglas en inglés) funge como una red internacional e infraestructura de datos financiada por los gobiernos del mundo para dar a cualquiera, en cualquier lugar, acceso abierto a datos sobre todas las formas de vida en la Tierra; no obstante la cuota de participacion en el depósito de estos datos evidencia otros rasgos como la participación científica de paises en estas redes. Como justificación de este proyecto, queremos explorar la distribución de los datos y ver si la cuota de participación en el deposito de estos en el Sistema Global de Información sobre Biodiversidad (GBIF por sus siglas en inglés) tiene una lata representación de países diversos, como Costa Rica ó si se encuentra dominada por algún otro factor, potencialmente relacionado a variables como financiamiento, poder adquisitivo, PIB invertido en ciencia, desarrollo científico etc.

Antecedentes¶

El conocimiento de la biodiversidad en el planeta es esencial para su aprovechamiento y protección. Entender el nicho, biología y potencial de grupos taxonómicos ha permitido que la sociedad desarrolle a partir de estos elementos de gran impacto y utilidad; con aplicación antibiótica, antiinflamatoria, biosintética, antihistamínica entre muchas otras (Pacyga et al. 2024). No obstante, existen grupos taxonómicos como los hongos (Blis & Gloer 2016) o bien ambientes de estudio donde el desconocimiento es elevado como en el caso de especies marinas (Rogers et al. 2022). Adicionalmente, en un inicio los registros de la biodiversidad eran manuales y poco personal tenia acceso a los mismos (Folk & Siniscalchi 2021) ya que se encontraban unificados en museos de paises desarrollados; sin embargo, el avance de la ciencia en sus múltiples dimensiones ha brindado un acceso masivo a la información y generación de datos; no obstante la sistematización de esta sigue siendo compleja (Kirk 2023, Alexander et al. 2024) y dificil de integrar. Aunado a esta complejidad se suma la poca participación o inclusión de países latinoamericanos con altos índices de biodiversidad, lo que dificulta visilibilizar el valor que natural que reside en estos y consecuentemente complica la implementación de politicas de protección, mitigación etc.

Descripción del problema y objetivo¶

Existe un catálogo de la vida, que unifica a todas las especies conocidas a la fecha (última actualización 26 de marzo/2024) y dada la relevancia internacional de Costa Rica como albergue del 8% de biodiversidad mundial deseamos evidenciar la cuota de participación Costarricense en este catálogo. Adicionalmente, el grupo taxonómico de los hongos es uno de los menos conocidos, explorados y categorizados, por lo que también enfocaremos nuestro estudio a este grupo con el fin de corroborar si efectivamente existe un desconocimiento real. Por lo tanto, nuestro objetivo consiste en explorar la distribución de organismos según región geográfica/país y conocer la participación costarricense y latinoamericana en estos registros; así como evidenciar el actual conocimiento existente en grupos taxonómicos específicos como el fúngico.

Instalación e importación de Bibliotecas¶

In [ ]:
#instalación de librerias
!pip install numpy
!pip install pandas
!pip install seaborn
!pip install scikit-learn
!pip install matplotlib
!pip install ydata-profiling
Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (1.25.2)
Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (2.0.3)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas) (2023.4)
Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas) (2024.1)
Requirement already satisfied: numpy>=1.21.0 in /usr/local/lib/python3.10/dist-packages (from pandas) (1.25.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)
Requirement already satisfied: seaborn in /usr/local/lib/python3.10/dist-packages (0.13.1)
Requirement already satisfied: numpy!=1.24.0,>=1.20 in /usr/local/lib/python3.10/dist-packages (from seaborn) (1.25.2)
Requirement already satisfied: pandas>=1.2 in /usr/local/lib/python3.10/dist-packages (from seaborn) (2.0.3)
Requirement already satisfied: matplotlib!=3.6.1,>=3.4 in /usr/local/lib/python3.10/dist-packages (from seaborn) (3.7.1)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.2.1)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (4.53.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.4.5)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (24.0)
Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (9.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (3.1.2)
Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.10/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.2->seaborn) (2023.4)
Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.2->seaborn) (2024.1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.4->seaborn) (1.16.0)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (1.2.2)
Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.25.2)
Requirement already satisfied: scipy>=1.3.2 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.11.4)
Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.4.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (3.5.0)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (3.7.1)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (1.2.1)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (4.53.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (1.4.5)
Requirement already satisfied: numpy>=1.20 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (1.25.2)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (24.0)
Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (9.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (3.1.2)
Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (2.8.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.7->matplotlib) (1.16.0)
Collecting ydata-profiling
  Downloading ydata_profiling-4.8.3-py2.py3-none-any.whl (359 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 359.5/359.5 kB 7.9 MB/s eta 0:00:00
Requirement already satisfied: scipy<1.14,>=1.4.1 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling) (1.11.4)
Requirement already satisfied: pandas!=1.4.0,<3,>1.1 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling) (2.0.3)
Requirement already satisfied: matplotlib<3.9,>=3.2 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling) (3.7.1)
Requirement already satisfied: pydantic>=2 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling) (2.7.3)
Requirement already satisfied: PyYAML<6.1,>=5.0.0 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling) (6.0.1)
Requirement already satisfied: jinja2<3.2,>=2.11.1 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling) (3.1.4)
Collecting visions[type_image_path]<0.7.7,>=0.7.5 (from ydata-profiling)
  Downloading visions-0.7.6-py3-none-any.whl (104 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 104.8/104.8 kB 13.9 MB/s eta 0:00:00
Requirement already satisfied: numpy<2,>=1.16.0 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling) (1.25.2)
Collecting htmlmin==0.1.12 (from ydata-profiling)
  Downloading htmlmin-0.1.12.tar.gz (19 kB)
  Preparing metadata (setup.py) ... done
Collecting phik<0.13,>=0.11.1 (from ydata-profiling)
  Downloading phik-0.12.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (686 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 686.1/686.1 kB 14.1 MB/s eta 0:00:00
Requirement already satisfied: requests<3,>=2.24.0 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling) (2.31.0)
Requirement already satisfied: tqdm<5,>=4.48.2 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling) (4.66.4)
Requirement already satisfied: seaborn<0.14,>=0.10.1 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling) (0.13.1)
Collecting multimethod<2,>=1.4 (from ydata-profiling)
  Downloading multimethod-1.11.2-py3-none-any.whl (10 kB)
Requirement already satisfied: statsmodels<1,>=0.13.2 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling) (0.14.2)
Collecting typeguard<5,>=3 (from ydata-profiling)
  Downloading typeguard-4.3.0-py3-none-any.whl (35 kB)
Collecting imagehash==4.3.1 (from ydata-profiling)
  Downloading ImageHash-4.3.1-py2.py3-none-any.whl (296 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 296.5/296.5 kB 14.4 MB/s eta 0:00:00
Requirement already satisfied: wordcloud>=1.9.1 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling) (1.9.3)
Collecting dacite>=1.8 (from ydata-profiling)
  Downloading dacite-1.8.1-py3-none-any.whl (14 kB)
Requirement already satisfied: numba<1,>=0.56.0 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling) (0.58.1)
Requirement already satisfied: PyWavelets in /usr/local/lib/python3.10/dist-packages (from imagehash==4.3.1->ydata-profiling) (1.6.0)
Requirement already satisfied: pillow in /usr/local/lib/python3.10/dist-packages (from imagehash==4.3.1->ydata-profiling) (9.4.0)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2<3.2,>=2.11.1->ydata-profiling) (2.1.5)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib<3.9,>=3.2->ydata-profiling) (1.2.1)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib<3.9,>=3.2->ydata-profiling) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib<3.9,>=3.2->ydata-profiling) (4.53.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib<3.9,>=3.2->ydata-profiling) (1.4.5)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib<3.9,>=3.2->ydata-profiling) (24.0)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib<3.9,>=3.2->ydata-profiling) (3.1.2)
Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.10/dist-packages (from matplotlib<3.9,>=3.2->ydata-profiling) (2.8.2)
Requirement already satisfied: llvmlite<0.42,>=0.41.0dev0 in /usr/local/lib/python3.10/dist-packages (from numba<1,>=0.56.0->ydata-profiling) (0.41.1)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas!=1.4.0,<3,>1.1->ydata-profiling) (2023.4)
Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas!=1.4.0,<3,>1.1->ydata-profiling) (2024.1)
Requirement already satisfied: joblib>=0.14.1 in /usr/local/lib/python3.10/dist-packages (from phik<0.13,>=0.11.1->ydata-profiling) (1.4.2)
Requirement already satisfied: annotated-types>=0.4.0 in /usr/local/lib/python3.10/dist-packages (from pydantic>=2->ydata-profiling) (0.7.0)
Requirement already satisfied: pydantic-core==2.18.4 in /usr/local/lib/python3.10/dist-packages (from pydantic>=2->ydata-profiling) (2.18.4)
Requirement already satisfied: typing-extensions>=4.6.1 in /usr/local/lib/python3.10/dist-packages (from pydantic>=2->ydata-profiling) (4.12.1)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.24.0->ydata-profiling) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.24.0->ydata-profiling) (3.7)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.24.0->ydata-profiling) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.24.0->ydata-profiling) (2024.6.2)
Requirement already satisfied: patsy>=0.5.6 in /usr/local/lib/python3.10/dist-packages (from statsmodels<1,>=0.13.2->ydata-profiling) (0.5.6)
Requirement already satisfied: attrs>=19.3.0 in /usr/local/lib/python3.10/dist-packages (from visions[type_image_path]<0.7.7,>=0.7.5->ydata-profiling) (23.2.0)
Requirement already satisfied: networkx>=2.4 in /usr/local/lib/python3.10/dist-packages (from visions[type_image_path]<0.7.7,>=0.7.5->ydata-profiling) (3.3)
Requirement already satisfied: six in /usr/local/lib/python3.10/dist-packages (from patsy>=0.5.6->statsmodels<1,>=0.13.2->ydata-profiling) (1.16.0)
Building wheels for collected packages: htmlmin
  Building wheel for htmlmin (setup.py) ... done
  Created wheel for htmlmin: filename=htmlmin-0.1.12-py3-none-any.whl size=27080 sha256=3ace8b4184c43941afd24668cb54735e54d61b82ef80144079012c4673086b9c
  Stored in directory: /root/.cache/pip/wheels/dd/91/29/a79cecb328d01739e64017b6fb9a1ab9d8cb1853098ec5966d
Successfully built htmlmin
Installing collected packages: htmlmin, typeguard, multimethod, dacite, imagehash, visions, phik, ydata-profiling
Successfully installed dacite-1.8.1 htmlmin-0.1.12 imagehash-4.3.1 multimethod-1.11.2 phik-0.12.4 typeguard-4.3.0 visions-0.7.6 ydata-profiling-4.8.3
In [ ]:
#Librería para indices de diversidad
pip install scikit-bio
Collecting scikit-bio
  Downloading scikit-bio-0.6.0.tar.gz (1.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 12.8 MB/s eta 0:00:00
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Requirement already satisfied: requests>=2.20.0 in /usr/local/lib/python3.10/dist-packages (from scikit-bio) (2.31.0)
Requirement already satisfied: decorator>=3.4.2 in /usr/local/lib/python3.10/dist-packages (from scikit-bio) (4.4.2)
Requirement already satisfied: natsort>=4.0.3 in /usr/local/lib/python3.10/dist-packages (from scikit-bio) (8.4.0)
Requirement already satisfied: numpy>=1.17.0 in /usr/local/lib/python3.10/dist-packages (from scikit-bio) (1.25.2)
Requirement already satisfied: pandas>=1.5.0 in /usr/local/lib/python3.10/dist-packages (from scikit-bio) (2.0.3)
Requirement already satisfied: scipy>=1.9.0 in /usr/local/lib/python3.10/dist-packages (from scikit-bio) (1.11.4)
Requirement already satisfied: h5py>=3.6.0 in /usr/local/lib/python3.10/dist-packages (from scikit-bio) (3.9.0)
Collecting hdmedians>=0.14.1 (from scikit-bio)
  Downloading hdmedians-0.14.2.tar.gz (7.6 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Installing backend dependencies ... done
  Preparing metadata (pyproject.toml) ... done
Collecting biom-format (from scikit-bio)
  Downloading biom-format-2.1.16.tar.gz (11.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11.7/11.7 MB 41.6 MB/s eta 0:00:00
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Requirement already satisfied: Cython>=0.23 in /usr/local/lib/python3.10/dist-packages (from hdmedians>=0.14.1->scikit-bio) (3.0.10)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.5.0->scikit-bio) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.5.0->scikit-bio) (2023.4)
Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.5.0->scikit-bio) (2024.1)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests>=2.20.0->scikit-bio) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests>=2.20.0->scikit-bio) (3.7)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.20.0->scikit-bio) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests>=2.20.0->scikit-bio) (2024.6.2)
Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from biom-format->scikit-bio) (8.1.7)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas>=1.5.0->scikit-bio) (1.16.0)
Building wheels for collected packages: scikit-bio, hdmedians, biom-format
  Building wheel for scikit-bio (pyproject.toml) ... done
  Created wheel for scikit-bio: filename=scikit_bio-0.6.0-cp310-cp310-linux_x86_64.whl size=2978855 sha256=68d602a6855ddc862eae379a52c02a23fb1421ac70f3846bcd866229e4dd466b
  Stored in directory: /root/.cache/pip/wheels/44/54/d7/d48067a8b538ad5e67e28c956204e2e564edd7ae5017d9252e
  Building wheel for hdmedians (pyproject.toml) ... done
  Created wheel for hdmedians: filename=hdmedians-0.14.2-cp310-cp310-linux_x86_64.whl size=677344 sha256=9b0fea9a6318fa76dba70e5bd90497a148752f1a30ad1630d4f98e6ffc1244f1
  Stored in directory: /root/.cache/pip/wheels/82/8f/0d/0c61130cfad119482ebb95aecf8d5dfaddd0181f5680da2bec
  Building wheel for biom-format (pyproject.toml) ... done
  Created wheel for biom-format: filename=biom_format-2.1.16-cp310-cp310-linux_x86_64.whl size=12163346 sha256=c17465fcc4025e1eb6aaf278485f6d5a5debb138206fc94f23bd8a40e06747ae
  Stored in directory: /root/.cache/pip/wheels/8e/a9/f9/197fd5a0e5bbab5f2e03c89194f6c194bed7af5d7a8c8759f3
Successfully built scikit-bio hdmedians biom-format
Installing collected packages: hdmedians, biom-format, scikit-bio
Successfully installed biom-format-2.1.16 hdmedians-0.14.2 scikit-bio-0.6.0
In [ ]:
#importar librerias
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import datasets
from ydata_profiling import ProfileReport

import matplotlib.pyplot as plt
%matplotlib inline

Importar set de Datos¶

Obtenidos del link: https://www.gbif.org/es/dataset/7ddf754f-d193-4cc9-b351-99906754a03b

Este conjunto de datos contiene 4 dataframes que recopilan datos sobre el catálogo de organismos de todas las especies conocidas en la tierra a la fecha. Este catálogo incluye especies extintas como vigentes y se cree que cubre por lo menos el 80% de las especies conocidas.

Las tablas de datos corresponden a

  • Distribución
  • Especies reportadas
  • Taxones reportados
  • Nombres vernaculares (autóctonos de cada región) para las especies indicadas

Cabe destacar que toda la información es en idioma inglés

Cita de los datos: Bánki, O., Roskov, Y., Döring, M., Ower, G., Hernández Robles, D. R., Plata Corredor, C. A., Stjernegaard Jeppesen, T., Örn, A., Vandepitte, L., Hobern, D., Schalk, P., DeWalt, R. E., Ma, K., Miller, J., Orrell, T., Aalbu, R., Abbott, J., Adlard, R., Aedo, C., et al. (2024). Catalogue of Life Checklist (Version 2024-03-26). Catalogue of Life. https://doi.org/10.48580/dfz8d

In [ ]:
#Cargar cada uno de los dataframes utilizando pandas
distribution = pd.read_csv('Distribution.tsv', sep='\t')
species = pd.read_csv('SpeciesProfile.tsv', sep='\t')
taxon = pd.read_csv('Taxon.tsv', sep='\t')

#Modificar encabezado de df para que sea más entendible [se elimina caracteres 'dwc']
distribution = distribution.rename(columns=lambda x: x.replace('dwc:',''))
species = species.rename(columns=lambda x: x.replace('dwc:',''))
taxon = taxon.rename(columns=lambda x: x.replace('dwc:',''))
<ipython-input-4-f448e452311d>:4: DtypeWarning: Columns (16) have mixed types. Specify dtype option on import or set low_memory=False.
  taxon = pd.read_csv('Taxon.tsv', sep='\t')
In [ ]:
#Explorar tamaño de archivos con shape
print("distribution.shape:", distribution.shape)
print("species.shape:", species.shape)
print("taxon.shape:", taxon.shape)
distribution.shape: (104015, 6)
species.shape: (473602, 5)
taxon.shape: (31349, 22)

Análisis exploratorio¶

In [ ]:
distribution#Ver el dataframe
Out[ ]:
taxonID occurrenceStatus locationID locality countryCode dcterms:source
0 6L823 native NaN Ecuador; Peru NaN NaN
1 T5NN native NaN Panama NaN NaN
2 7FVWC native mrgid:1912 NaN NaN NaN
3 7FVWC native mrgid:8402 NaN NaN Ax, P., & Sopott-Ehlers, B. (1987). Otoplanida...
4 3WT95 native tdwg:SUM NaN NaN Group, S.F. (2023) SF specimen locality data f...
... ... ... ... ... ... ...
104010 6L7TZ native tdwg:PAN NaN NaN NaN
104011 6L7TZ native tdwg:PER NaN NaN NaN
104012 6L7TZ native tdwg:VEN NaN NaN NaN
104013 6BNZY native NaN Congo NaN NaN
104014 9M79Q native tdwg:ABT-OO NaN NaN Newton, A.F. (2021) StaphBase: Staphyliniformi...

104015 rows × 6 columns

In [ ]:
df['locality'].unique()#Ver datos de la variable
Out[ ]:
array(['Ecuador; Peru', 'Panama', nan, ...,
       'Europe (AU BE BH CZ DE FI FR GB GE GR HU IT NL NR PL RO SK SL SP SV SZ UK), Russia (n+s European)',
       'NE USA; USA: Indiana; USA: New York; USA: West Virginia',
       'China (Guizhou, Zhejiang)'], dtype=object)
In [ ]:
species#ver data frame
Out[ ]:
taxonID gbif:isExtinct gbif:isMarine gbif:isFreshwater gbif:isTerrestrial
0 8XRM False False False True
1 6D3DF NaN True False False
2 49V84 NaN True False False
3 BYZP2 True NaN NaN NaN
4 3JK6Z False False False True
... ... ... ... ... ...
473597 9QLSL false NaN NaN NaN
473598 BJBQ4 false NaN NaN NaN
473599 6YNKM false NaN NaN NaN
473600 6KDRH false False False True
473601 35CZM fals NaN NaN NaN

473602 rows × 5 columns

In [ ]:
taxon
Out[ ]:
taxonID parentNameUsageID acceptedNameUsageID originalNameUsageID scientificNameID datasetID taxonomicStatus taxonRank scientificName scientificNameAuthorship ... infragenericEpithet specificEpithet infraspecificEpithet cultivarEpithet nameAccordingTo namePublishedIn nomenclaturalCode nomenclaturalStatus taxonRemarks dcterms:references
0 9FSLC 92BPW NaN NaN ---3nn39ZQdkDGBvoaGdR2 55434.0 accepted species Homaloxestis australis Park, 2004 Park, 2004 ... NaN australis NaN NaN NaN Park, K.-T. (2004) Genus Homaloxestis Meyrick ... ICZN nomen legitimum NaN NaN
1 8XRM 8NLB3 NaN NaN ---6f-YWvlv8BS-R6m-8Y 1050.0 accepted species Acanthograeffea modesta Günther, 1932 Günther, 1932 ... NaN modesta NaN NaN NaN Günther, K. (1932) Beiträge zur Systematik und... ICZN nomen legitimum NaN NaN
2 6D3DF 7NWBC NaN NaN ---9Qo8j1JQR04niBsWYb0 1191.0 accepted species Diarthrodes gravellicola Soyer, 1975 Soyer, 1975 ... NaN gravellicola NaN NaN NaN Soyer, J. (1975). Contribution a l’étude des C... ICZN nomen validum NaN https://www.marinespecies.org/copepoda/aphia.p...
3 47BF5 63SP NaN NaN ---BEZLG8WfmCKzOoARWg1 1141.0 accepted species Neurotheca congolana De Wild. & T. Durand De Wild. & T. Durand ... NaN congolana NaN NaN NaN De Wild. & T. Durand. (1899). In: Compt. Rend.... ICN NaN NaN http://www.worldplants.de/?deeplink=Neurotheca...
4 5BVQY 87PB NaN NaN ---D7syAmBAb7tLYVFT3L2 1141.0 accepted species Weberbauerocereus cephalomacrostibas (Werderm.... (Werderm. & Backeb.) F. Ritter ... NaN cephalomacrostibas NaN NaN NaN Ritter, F. (1981). In: Kakteen Südamer. 4: 1353. ICN NaN NaN http://www.worldplants.de/?deeplink=Weberbauer...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
31344 BTY38 BTWH2 NaN NaN -W4y6bo2Lksp9tq-VpszI 2299.0 accepted suborder Orthotetidina Waagen, 1884 Waagen, 1884 ... NaN NaN NaN NaN NaN Waagen, W. H. (1884). Productus Limestone Foss... ICZN nomen validum NaN https://www.marinespecies.org/aphia.php?p=taxd...
31345 7CZ6W 83V5 NaN NaN -W519nFjr9suClYRetk6q2 1141.0 accepted species Turnera dasytricha Pilg. Pilg. ... NaN dasytricha NaN NaN NaN Pilg. (1902). In: Bot. Jahrb. Syst. 30: 176. ICN NaN NaN http://www.worldplants.de/?deeplink=Turnera-da...
31346 6W2JF 6SBV NaN NaN -W56CNeiw5dk2Su0z29DX2 1027.0 accepted species Plectris luctuosa Frey, 1967 Frey, 1967 ... NaN luctuosa NaN NaN NaN Frey, G. (1967). Die Gattung Plectris (Philoch... ICZN NaN NaN NaN
31347 9YKMW NaN 3R6S7 NaN -W5K30rzgvcP9lNsPt33i0 1011.0 synonym species Seliza bisecta Kirby, 1891 Kirby, 1891 ... NaN bisecta NaN NaN NaN NaN ICZN NaN NaN NaN
31348 4HQWW 9CLPF NaN 0Xhs7bfwG98S5S8G63Jxw -W5LTi_j2Ftx5OuOeipNL 2304.0 accepted specie NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

31349 rows × 22 columns

In [ ]:
# Obtener los taxones con variables en común
common_taxonIDs = set(distribution['taxonID']).intersection(species['taxonID']).intersection(taxon['taxonID'])

print(common_taxonIDs)
{'6RBK4', 'D5ZP', '7CTDT', '7RZW5', '72P8R', 'BP7FL', '86986', '8GWQF', '8G3C5', 'GVSX', '676MZ', '6WHSF', '3DVL6', '3KVG8', '6P75C', '524CV', '9GFZD', '9BWKL', '4KQQW', '532JC', '86JNS', '5L38J', '86S4B', '9BCPC', '5FPVK', '552PQ', '7ZLYJ', '46LW8', '9F568', '8QJLB', 'LJ2C', '7BFPS', '6TSHS', '6HWHQ', '8P8WT', 'BN3LN', '4HYYF', '5KW94', '9YQM', 'PWD5', '4B8GS', '86FMH', '85YY4', '555YN', '7XMSR', '699HV', '4MLC6', '6RHNT', '47FD6', '7SX8S', '4QTDQ', '64SQ4', '7JZPT', 'JR22', '39YWT', '9J5CP', '559WJ', '8P9YT', '3H55S', '4BQMW', '4V63V', '4GD5Q', '3TQMR', '7DF9C', '7ZPML', '74KSW', '4PCM4', '64ZXD', '4WXCC', 'JWWP', '3QPR3', '6TWPK', 'D58M', '3GD7V', 'C4WM4', '8TDY9', '4MVL7', '3CGQK', '64VN3', '33HGB', '5X5GN', '3P6VT', '6KWGY', '5TX57', 'WK2C', '4KS6F', '4WWMQ', '7ZBLC', '4DJ6G', 'H66D', '79HGH', '3S468', '894CG', 'RVYX', 'B43VH', '3HDR7', '854PQ', '56V4T', 'CQ5G', 'B75BM', '7F8TJ', '46TDG', '93KL8', '6MMCC', '4VZMP', 'B7623', 'NC39', '3LGQD', '6BGKN', '67YTD', '6M566', '3L7YN', '69KXK', '7TDW6', '6SGW7', '76XPY', '7QQNJ'}
In [ ]:
#Unir los 3 dataframe por la columna taxonID, manteniendo las filas que no hacen match
df = distribution.merge(species, on='taxonID', how='outer').merge(taxon, on='taxonID', how='outer')
df
Out[ ]:
taxonID occurrenceStatus locationID locality countryCode dcterms:source gbif:isExtinct gbif:isMarine gbif:isFreshwater gbif:isTerrestrial ... infragenericEpithet specificEpithet infraspecificEpithet cultivarEpithet nameAccordingTo namePublishedIn nomenclaturalCode nomenclaturalStatus taxonRemarks dcterms:references
0 6L823 native NaN Ecuador; Peru NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 T5NN native NaN Panama NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 7FVWC native mrgid:1912 NaN NaN NaN NaN True False False ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 7FVWC native mrgid:8402 NaN NaN Ax, P., & Sopott-Ehlers, B. (1987). Otoplanida... NaN True False False ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 3WT95 native tdwg:SUM NaN NaN Group, S.F. (2023) SF specimen locality data f... False False False True ... NaN luctuosa NaN NaN NaN Brunner von Wattenwyl, C. (1888) Monographie d... ICZN nomen legitimum NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3115289 7KXQ3 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN longa parviflora NaN NaN Maire, & Weiller. (1961). In: Fl. Afrique N. 7... ICN NaN NaN NaN
3115290 76H4K NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN sphaerica NaN NaN NaN H. Schaef., S. S. Renner. (2011). In: Taxon 60... ICN NaN NaN http://www.worldplants.de/?deeplink=Penelopeia...
3115291 76MKV NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN nigella NaN NaN NaN Cuatrec. (1981). In: Phytologia 49(3): 248. ICN NaN NaN NaN
3115292 3NNXQ NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN acuta NaN NaN NaN NaN ICZN nomen legitimum NaN NaN
3115293 VTHP NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

3115294 rows × 31 columns

In [ ]:
#Explorar columnas de cada df individualmente
print("distribution.shape:", distribution.columns)
print("species.shape:", species.columns)
print("taxon.shape:", taxon.columns)
distribution.shape: Index(['taxonID', 'occurrenceStatus', 'locationID', 'locality', 'countryCode',
       'dcterms:source'],
      dtype='object')
species.shape: Index(['taxonID', 'gbif:isExtinct', 'gbif:isMarine', 'gbif:isFreshwater',
       'gbif:isTerrestrial'],
      dtype='object')
taxon.shape: Index(['taxonID', 'parentNameUsageID', 'acceptedNameUsageID',
       'originalNameUsageID', 'scientificNameID', 'datasetID',
       'taxonomicStatus', 'taxonRank', 'scientificName',
       'scientificNameAuthorship', 'col:notho', 'genericName',
       'infragenericEpithet', 'specificEpithet', 'infraspecificEpithet',
       'cultivarEpithet', 'nameAccordingTo', 'namePublishedIn',
       'nomenclaturalCode', 'nomenclaturalStatus', 'taxonRemarks',
       'dcterms:references'],
      dtype='object')
In [ ]:
# Generar el informe con pandas-profiling para df SpeciesProfile
df_profile = ProfileReport(df, title="Informe Pandas Profiling - SpeciesProfile Dataset", explorative=True)
df_profile.to_file("species_report.html")
/usr/local/lib/python3.10/dist-packages/ydata_profiling/profile_report.py:363: UserWarning: Try running command: 'pip install --upgrade Pillow' to avoid ValueError
  warnings.warn(
Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]
Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]
Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]
Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]
In [ ]:
df_profile