Proyecto "Determinación de especies según distribución geográfica"¶
Integrantes:
- Luis Arrieta Arrieta
- Stefany Solano González
Descripción¶
Este proyecto nace para evidenciar la falta de sistematización de biodiversidad que existe y cómo a pesar de que Costa Rica posee aproximadamente un 8% de la riqueza natural, esta no se visibiliza en bases de datos internacionales. Adicionalmente, en este trabajo queremos explorar los registros existentes en la base de datos del Catálogo de la vida, haciendo un pequeño énfasis en el grupo taxonómico Fungi. Como justificación de este proyecto, el Sistema Global de Información sobre Biodiversidad (GBIF por sus siglas en inglés) funge como una red internacional e infraestructura de datos financiada por los gobiernos del mundo para dar a cualquiera, en cualquier lugar, acceso abierto a datos sobre todas las formas de vida en la Tierra; no obstante la cuota de participacion en el depósito de estos datos evidencia otros rasgos como la participación científica de paises en estas redes. Como justificación de este proyecto, queremos explorar la distribución de los datos y ver si la cuota de participación en el deposito de estos en el Sistema Global de Información sobre Biodiversidad (GBIF por sus siglas en inglés) tiene una lata representación de países diversos, como Costa Rica ó si se encuentra dominada por algún otro factor, potencialmente relacionado a variables como financiamiento, poder adquisitivo, PIB invertido en ciencia, desarrollo científico etc.
Antecedentes¶
El conocimiento de la biodiversidad en el planeta es esencial para su aprovechamiento y protección. Entender el nicho, biología y potencial de grupos taxonómicos ha permitido que la sociedad desarrolle a partir de estos elementos de gran impacto y utilidad; con aplicación antibiótica, antiinflamatoria, biosintética, antihistamínica entre muchas otras (Pacyga et al. 2024). No obstante, existen grupos taxonómicos como los hongos (Blis & Gloer 2016) o bien ambientes de estudio donde el desconocimiento es elevado como en el caso de especies marinas (Rogers et al. 2022). Adicionalmente, en un inicio los registros de la biodiversidad eran manuales y poco personal tenia acceso a los mismos (Folk & Siniscalchi 2021) ya que se encontraban unificados en museos de paises desarrollados; sin embargo, el avance de la ciencia en sus múltiples dimensiones ha brindado un acceso masivo a la información y generación de datos; no obstante la sistematización de esta sigue siendo compleja (Kirk 2023, Alexander et al. 2024) y dificil de integrar. Aunado a esta complejidad se suma la poca participación o inclusión de países latinoamericanos con altos índices de biodiversidad, lo que dificulta visilibilizar el valor que natural que reside en estos y consecuentemente complica la implementación de politicas de protección, mitigación etc.
Descripción del problema y objetivo¶
Existe un catálogo de la vida, que unifica a todas las especies conocidas a la fecha (última actualización 26 de marzo/2024) y dada la relevancia internacional de Costa Rica como albergue del 8% de biodiversidad mundial deseamos evidenciar la cuota de participación Costarricense en este catálogo. Adicionalmente, el grupo taxonómico de los hongos es uno de los menos conocidos, explorados y categorizados, por lo que también enfocaremos nuestro estudio a este grupo con el fin de corroborar si efectivamente existe un desconocimiento real. Por lo tanto, nuestro objetivo consiste en explorar la distribución de organismos según región geográfica/país y conocer la participación costarricense y latinoamericana en estos registros; así como evidenciar el actual conocimiento existente en grupos taxonómicos específicos como el fúngico.
Instalación e importación de Bibliotecas¶
#instalación de librerias
!pip install numpy
!pip install pandas
!pip install seaborn
!pip install scikit-learn
!pip install matplotlib
!pip install ydata-profiling
Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (1.25.2)
Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (2.0.3)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas) (2023.4)
Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas) (2024.1)
Requirement already satisfied: numpy>=1.21.0 in /usr/local/lib/python3.10/dist-packages (from pandas) (1.25.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)
Requirement already satisfied: seaborn in /usr/local/lib/python3.10/dist-packages (0.13.1)
Requirement already satisfied: numpy!=1.24.0,>=1.20 in /usr/local/lib/python3.10/dist-packages (from seaborn) (1.25.2)
Requirement already satisfied: pandas>=1.2 in /usr/local/lib/python3.10/dist-packages (from seaborn) (2.0.3)
Requirement already satisfied: matplotlib!=3.6.1,>=3.4 in /usr/local/lib/python3.10/dist-packages (from seaborn) (3.7.1)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.2.1)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (4.53.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.4.5)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (24.0)
Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (9.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (3.1.2)
Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.10/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.2->seaborn) (2023.4)
Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.2->seaborn) (2024.1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.4->seaborn) (1.16.0)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (1.2.2)
Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.25.2)
Requirement already satisfied: scipy>=1.3.2 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.11.4)
Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.4.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (3.5.0)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (3.7.1)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (1.2.1)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (4.53.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (1.4.5)
Requirement already satisfied: numpy>=1.20 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (1.25.2)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (24.0)
Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (9.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (3.1.2)
Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (2.8.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.7->matplotlib) (1.16.0)
Collecting ydata-profiling
Downloading ydata_profiling-4.8.3-py2.py3-none-any.whl (359 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 359.5/359.5 kB 7.9 MB/s eta 0:00:00
Requirement already satisfied: scipy<1.14,>=1.4.1 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling) (1.11.4)
Requirement already satisfied: pandas!=1.4.0,<3,>1.1 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling) (2.0.3)
Requirement already satisfied: matplotlib<3.9,>=3.2 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling) (3.7.1)
Requirement already satisfied: pydantic>=2 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling) (2.7.3)
Requirement already satisfied: PyYAML<6.1,>=5.0.0 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling) (6.0.1)
Requirement already satisfied: jinja2<3.2,>=2.11.1 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling) (3.1.4)
Collecting visions[type_image_path]<0.7.7,>=0.7.5 (from ydata-profiling)
Downloading visions-0.7.6-py3-none-any.whl (104 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 104.8/104.8 kB 13.9 MB/s eta 0:00:00
Requirement already satisfied: numpy<2,>=1.16.0 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling) (1.25.2)
Collecting htmlmin==0.1.12 (from ydata-profiling)
Downloading htmlmin-0.1.12.tar.gz (19 kB)
Preparing metadata (setup.py) ... done
Collecting phik<0.13,>=0.11.1 (from ydata-profiling)
Downloading phik-0.12.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (686 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 686.1/686.1 kB 14.1 MB/s eta 0:00:00
Requirement already satisfied: requests<3,>=2.24.0 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling) (2.31.0)
Requirement already satisfied: tqdm<5,>=4.48.2 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling) (4.66.4)
Requirement already satisfied: seaborn<0.14,>=0.10.1 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling) (0.13.1)
Collecting multimethod<2,>=1.4 (from ydata-profiling)
Downloading multimethod-1.11.2-py3-none-any.whl (10 kB)
Requirement already satisfied: statsmodels<1,>=0.13.2 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling) (0.14.2)
Collecting typeguard<5,>=3 (from ydata-profiling)
Downloading typeguard-4.3.0-py3-none-any.whl (35 kB)
Collecting imagehash==4.3.1 (from ydata-profiling)
Downloading ImageHash-4.3.1-py2.py3-none-any.whl (296 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 296.5/296.5 kB 14.4 MB/s eta 0:00:00
Requirement already satisfied: wordcloud>=1.9.1 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling) (1.9.3)
Collecting dacite>=1.8 (from ydata-profiling)
Downloading dacite-1.8.1-py3-none-any.whl (14 kB)
Requirement already satisfied: numba<1,>=0.56.0 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling) (0.58.1)
Requirement already satisfied: PyWavelets in /usr/local/lib/python3.10/dist-packages (from imagehash==4.3.1->ydata-profiling) (1.6.0)
Requirement already satisfied: pillow in /usr/local/lib/python3.10/dist-packages (from imagehash==4.3.1->ydata-profiling) (9.4.0)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2<3.2,>=2.11.1->ydata-profiling) (2.1.5)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib<3.9,>=3.2->ydata-profiling) (1.2.1)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib<3.9,>=3.2->ydata-profiling) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib<3.9,>=3.2->ydata-profiling) (4.53.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib<3.9,>=3.2->ydata-profiling) (1.4.5)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib<3.9,>=3.2->ydata-profiling) (24.0)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib<3.9,>=3.2->ydata-profiling) (3.1.2)
Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.10/dist-packages (from matplotlib<3.9,>=3.2->ydata-profiling) (2.8.2)
Requirement already satisfied: llvmlite<0.42,>=0.41.0dev0 in /usr/local/lib/python3.10/dist-packages (from numba<1,>=0.56.0->ydata-profiling) (0.41.1)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas!=1.4.0,<3,>1.1->ydata-profiling) (2023.4)
Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas!=1.4.0,<3,>1.1->ydata-profiling) (2024.1)
Requirement already satisfied: joblib>=0.14.1 in /usr/local/lib/python3.10/dist-packages (from phik<0.13,>=0.11.1->ydata-profiling) (1.4.2)
Requirement already satisfied: annotated-types>=0.4.0 in /usr/local/lib/python3.10/dist-packages (from pydantic>=2->ydata-profiling) (0.7.0)
Requirement already satisfied: pydantic-core==2.18.4 in /usr/local/lib/python3.10/dist-packages (from pydantic>=2->ydata-profiling) (2.18.4)
Requirement already satisfied: typing-extensions>=4.6.1 in /usr/local/lib/python3.10/dist-packages (from pydantic>=2->ydata-profiling) (4.12.1)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.24.0->ydata-profiling) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.24.0->ydata-profiling) (3.7)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.24.0->ydata-profiling) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.24.0->ydata-profiling) (2024.6.2)
Requirement already satisfied: patsy>=0.5.6 in /usr/local/lib/python3.10/dist-packages (from statsmodels<1,>=0.13.2->ydata-profiling) (0.5.6)
Requirement already satisfied: attrs>=19.3.0 in /usr/local/lib/python3.10/dist-packages (from visions[type_image_path]<0.7.7,>=0.7.5->ydata-profiling) (23.2.0)
Requirement already satisfied: networkx>=2.4 in /usr/local/lib/python3.10/dist-packages (from visions[type_image_path]<0.7.7,>=0.7.5->ydata-profiling) (3.3)
Requirement already satisfied: six in /usr/local/lib/python3.10/dist-packages (from patsy>=0.5.6->statsmodels<1,>=0.13.2->ydata-profiling) (1.16.0)
Building wheels for collected packages: htmlmin
Building wheel for htmlmin (setup.py) ... done
Created wheel for htmlmin: filename=htmlmin-0.1.12-py3-none-any.whl size=27080 sha256=3ace8b4184c43941afd24668cb54735e54d61b82ef80144079012c4673086b9c
Stored in directory: /root/.cache/pip/wheels/dd/91/29/a79cecb328d01739e64017b6fb9a1ab9d8cb1853098ec5966d
Successfully built htmlmin
Installing collected packages: htmlmin, typeguard, multimethod, dacite, imagehash, visions, phik, ydata-profiling
Successfully installed dacite-1.8.1 htmlmin-0.1.12 imagehash-4.3.1 multimethod-1.11.2 phik-0.12.4 typeguard-4.3.0 visions-0.7.6 ydata-profiling-4.8.3
#Librería para indices de diversidad
pip install scikit-bio
Collecting scikit-bio
Downloading scikit-bio-0.6.0.tar.gz (1.7 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 12.8 MB/s eta 0:00:00
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing metadata (pyproject.toml) ... done
Requirement already satisfied: requests>=2.20.0 in /usr/local/lib/python3.10/dist-packages (from scikit-bio) (2.31.0)
Requirement already satisfied: decorator>=3.4.2 in /usr/local/lib/python3.10/dist-packages (from scikit-bio) (4.4.2)
Requirement already satisfied: natsort>=4.0.3 in /usr/local/lib/python3.10/dist-packages (from scikit-bio) (8.4.0)
Requirement already satisfied: numpy>=1.17.0 in /usr/local/lib/python3.10/dist-packages (from scikit-bio) (1.25.2)
Requirement already satisfied: pandas>=1.5.0 in /usr/local/lib/python3.10/dist-packages (from scikit-bio) (2.0.3)
Requirement already satisfied: scipy>=1.9.0 in /usr/local/lib/python3.10/dist-packages (from scikit-bio) (1.11.4)
Requirement already satisfied: h5py>=3.6.0 in /usr/local/lib/python3.10/dist-packages (from scikit-bio) (3.9.0)
Collecting hdmedians>=0.14.1 (from scikit-bio)
Downloading hdmedians-0.14.2.tar.gz (7.6 kB)
Installing build dependencies ... done
Getting requirements to build wheel ... done
Installing backend dependencies ... done
Preparing metadata (pyproject.toml) ... done
Collecting biom-format (from scikit-bio)
Downloading biom-format-2.1.16.tar.gz (11.7 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11.7/11.7 MB 41.6 MB/s eta 0:00:00
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing metadata (pyproject.toml) ... done
Requirement already satisfied: Cython>=0.23 in /usr/local/lib/python3.10/dist-packages (from hdmedians>=0.14.1->scikit-bio) (3.0.10)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.5.0->scikit-bio) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.5.0->scikit-bio) (2023.4)
Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.5.0->scikit-bio) (2024.1)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests>=2.20.0->scikit-bio) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests>=2.20.0->scikit-bio) (3.7)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.20.0->scikit-bio) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests>=2.20.0->scikit-bio) (2024.6.2)
Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from biom-format->scikit-bio) (8.1.7)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas>=1.5.0->scikit-bio) (1.16.0)
Building wheels for collected packages: scikit-bio, hdmedians, biom-format
Building wheel for scikit-bio (pyproject.toml) ... done
Created wheel for scikit-bio: filename=scikit_bio-0.6.0-cp310-cp310-linux_x86_64.whl size=2978855 sha256=68d602a6855ddc862eae379a52c02a23fb1421ac70f3846bcd866229e4dd466b
Stored in directory: /root/.cache/pip/wheels/44/54/d7/d48067a8b538ad5e67e28c956204e2e564edd7ae5017d9252e
Building wheel for hdmedians (pyproject.toml) ... done
Created wheel for hdmedians: filename=hdmedians-0.14.2-cp310-cp310-linux_x86_64.whl size=677344 sha256=9b0fea9a6318fa76dba70e5bd90497a148752f1a30ad1630d4f98e6ffc1244f1
Stored in directory: /root/.cache/pip/wheels/82/8f/0d/0c61130cfad119482ebb95aecf8d5dfaddd0181f5680da2bec
Building wheel for biom-format (pyproject.toml) ... done
Created wheel for biom-format: filename=biom_format-2.1.16-cp310-cp310-linux_x86_64.whl size=12163346 sha256=c17465fcc4025e1eb6aaf278485f6d5a5debb138206fc94f23bd8a40e06747ae
Stored in directory: /root/.cache/pip/wheels/8e/a9/f9/197fd5a0e5bbab5f2e03c89194f6c194bed7af5d7a8c8759f3
Successfully built scikit-bio hdmedians biom-format
Installing collected packages: hdmedians, biom-format, scikit-bio
Successfully installed biom-format-2.1.16 hdmedians-0.14.2 scikit-bio-0.6.0
#importar librerias
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import datasets
from ydata_profiling import ProfileReport
import matplotlib.pyplot as plt
%matplotlib inline
Importar set de Datos¶
Obtenidos del link: https://www.gbif.org/es/dataset/7ddf754f-d193-4cc9-b351-99906754a03b
Este conjunto de datos contiene 4 dataframes que recopilan datos sobre el catálogo de organismos de todas las especies conocidas en la tierra a la fecha. Este catálogo incluye especies extintas como vigentes y se cree que cubre por lo menos el 80% de las especies conocidas.
Las tablas de datos corresponden a
- Distribución
- Especies reportadas
- Taxones reportados
- Nombres vernaculares (autóctonos de cada región) para las especies indicadas
Cabe destacar que toda la información es en idioma inglés
Cita de los datos: Bánki, O., Roskov, Y., Döring, M., Ower, G., Hernández Robles, D. R., Plata Corredor, C. A., Stjernegaard Jeppesen, T., Örn, A., Vandepitte, L., Hobern, D., Schalk, P., DeWalt, R. E., Ma, K., Miller, J., Orrell, T., Aalbu, R., Abbott, J., Adlard, R., Aedo, C., et al. (2024). Catalogue of Life Checklist (Version 2024-03-26). Catalogue of Life. https://doi.org/10.48580/dfz8d
#Cargar cada uno de los dataframes utilizando pandas
distribution = pd.read_csv('Distribution.tsv', sep='\t')
species = pd.read_csv('SpeciesProfile.tsv', sep='\t')
taxon = pd.read_csv('Taxon.tsv', sep='\t')
#Modificar encabezado de df para que sea más entendible [se elimina caracteres 'dwc']
distribution = distribution.rename(columns=lambda x: x.replace('dwc:',''))
species = species.rename(columns=lambda x: x.replace('dwc:',''))
taxon = taxon.rename(columns=lambda x: x.replace('dwc:',''))
<ipython-input-4-f448e452311d>:4: DtypeWarning: Columns (16) have mixed types. Specify dtype option on import or set low_memory=False.
taxon = pd.read_csv('Taxon.tsv', sep='\t')
#Explorar tamaño de archivos con shape
print("distribution.shape:", distribution.shape)
print("species.shape:", species.shape)
print("taxon.shape:", taxon.shape)
distribution.shape: (104015, 6) species.shape: (473602, 5) taxon.shape: (31349, 22)
Análisis exploratorio¶
distribution#Ver el dataframe
| taxonID | occurrenceStatus | locationID | locality | countryCode | dcterms:source | |
|---|---|---|---|---|---|---|
| 0 | 6L823 | native | NaN | Ecuador; Peru | NaN | NaN |
| 1 | T5NN | native | NaN | Panama | NaN | NaN |
| 2 | 7FVWC | native | mrgid:1912 | NaN | NaN | NaN |
| 3 | 7FVWC | native | mrgid:8402 | NaN | NaN | Ax, P., & Sopott-Ehlers, B. (1987). Otoplanida... |
| 4 | 3WT95 | native | tdwg:SUM | NaN | NaN | Group, S.F. (2023) SF specimen locality data f... |
| ... | ... | ... | ... | ... | ... | ... |
| 104010 | 6L7TZ | native | tdwg:PAN | NaN | NaN | NaN |
| 104011 | 6L7TZ | native | tdwg:PER | NaN | NaN | NaN |
| 104012 | 6L7TZ | native | tdwg:VEN | NaN | NaN | NaN |
| 104013 | 6BNZY | native | NaN | Congo | NaN | NaN |
| 104014 | 9M79Q | native | tdwg:ABT-OO | NaN | NaN | Newton, A.F. (2021) StaphBase: Staphyliniformi... |
104015 rows × 6 columns
df['locality'].unique()#Ver datos de la variable
array(['Ecuador; Peru', 'Panama', nan, ...,
'Europe (AU BE BH CZ DE FI FR GB GE GR HU IT NL NR PL RO SK SL SP SV SZ UK), Russia (n+s European)',
'NE USA; USA: Indiana; USA: New York; USA: West Virginia',
'China (Guizhou, Zhejiang)'], dtype=object)
species#ver data frame
| taxonID | gbif:isExtinct | gbif:isMarine | gbif:isFreshwater | gbif:isTerrestrial | |
|---|---|---|---|---|---|
| 0 | 8XRM | False | False | False | True |
| 1 | 6D3DF | NaN | True | False | False |
| 2 | 49V84 | NaN | True | False | False |
| 3 | BYZP2 | True | NaN | NaN | NaN |
| 4 | 3JK6Z | False | False | False | True |
| ... | ... | ... | ... | ... | ... |
| 473597 | 9QLSL | false | NaN | NaN | NaN |
| 473598 | BJBQ4 | false | NaN | NaN | NaN |
| 473599 | 6YNKM | false | NaN | NaN | NaN |
| 473600 | 6KDRH | false | False | False | True |
| 473601 | 35CZM | fals | NaN | NaN | NaN |
473602 rows × 5 columns
taxon
| taxonID | parentNameUsageID | acceptedNameUsageID | originalNameUsageID | scientificNameID | datasetID | taxonomicStatus | taxonRank | scientificName | scientificNameAuthorship | ... | infragenericEpithet | specificEpithet | infraspecificEpithet | cultivarEpithet | nameAccordingTo | namePublishedIn | nomenclaturalCode | nomenclaturalStatus | taxonRemarks | dcterms:references | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 9FSLC | 92BPW | NaN | NaN | ---3nn39ZQdkDGBvoaGdR2 | 55434.0 | accepted | species | Homaloxestis australis Park, 2004 | Park, 2004 | ... | NaN | australis | NaN | NaN | NaN | Park, K.-T. (2004) Genus Homaloxestis Meyrick ... | ICZN | nomen legitimum | NaN | NaN |
| 1 | 8XRM | 8NLB3 | NaN | NaN | ---6f-YWvlv8BS-R6m-8Y | 1050.0 | accepted | species | Acanthograeffea modesta Günther, 1932 | Günther, 1932 | ... | NaN | modesta | NaN | NaN | NaN | Günther, K. (1932) Beiträge zur Systematik und... | ICZN | nomen legitimum | NaN | NaN |
| 2 | 6D3DF | 7NWBC | NaN | NaN | ---9Qo8j1JQR04niBsWYb0 | 1191.0 | accepted | species | Diarthrodes gravellicola Soyer, 1975 | Soyer, 1975 | ... | NaN | gravellicola | NaN | NaN | NaN | Soyer, J. (1975). Contribution a l’étude des C... | ICZN | nomen validum | NaN | https://www.marinespecies.org/copepoda/aphia.p... |
| 3 | 47BF5 | 63SP | NaN | NaN | ---BEZLG8WfmCKzOoARWg1 | 1141.0 | accepted | species | Neurotheca congolana De Wild. & T. Durand | De Wild. & T. Durand | ... | NaN | congolana | NaN | NaN | NaN | De Wild. & T. Durand. (1899). In: Compt. Rend.... | ICN | NaN | NaN | http://www.worldplants.de/?deeplink=Neurotheca... |
| 4 | 5BVQY | 87PB | NaN | NaN | ---D7syAmBAb7tLYVFT3L2 | 1141.0 | accepted | species | Weberbauerocereus cephalomacrostibas (Werderm.... | (Werderm. & Backeb.) F. Ritter | ... | NaN | cephalomacrostibas | NaN | NaN | NaN | Ritter, F. (1981). In: Kakteen Südamer. 4: 1353. | ICN | NaN | NaN | http://www.worldplants.de/?deeplink=Weberbauer... |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 31344 | BTY38 | BTWH2 | NaN | NaN | -W4y6bo2Lksp9tq-VpszI | 2299.0 | accepted | suborder | Orthotetidina Waagen, 1884 | Waagen, 1884 | ... | NaN | NaN | NaN | NaN | NaN | Waagen, W. H. (1884). Productus Limestone Foss... | ICZN | nomen validum | NaN | https://www.marinespecies.org/aphia.php?p=taxd... |
| 31345 | 7CZ6W | 83V5 | NaN | NaN | -W519nFjr9suClYRetk6q2 | 1141.0 | accepted | species | Turnera dasytricha Pilg. | Pilg. | ... | NaN | dasytricha | NaN | NaN | NaN | Pilg. (1902). In: Bot. Jahrb. Syst. 30: 176. | ICN | NaN | NaN | http://www.worldplants.de/?deeplink=Turnera-da... |
| 31346 | 6W2JF | 6SBV | NaN | NaN | -W56CNeiw5dk2Su0z29DX2 | 1027.0 | accepted | species | Plectris luctuosa Frey, 1967 | Frey, 1967 | ... | NaN | luctuosa | NaN | NaN | NaN | Frey, G. (1967). Die Gattung Plectris (Philoch... | ICZN | NaN | NaN | NaN |
| 31347 | 9YKMW | NaN | 3R6S7 | NaN | -W5K30rzgvcP9lNsPt33i0 | 1011.0 | synonym | species | Seliza bisecta Kirby, 1891 | Kirby, 1891 | ... | NaN | bisecta | NaN | NaN | NaN | NaN | ICZN | NaN | NaN | NaN |
| 31348 | 4HQWW | 9CLPF | NaN | 0Xhs7bfwG98S5S8G63Jxw | -W5LTi_j2Ftx5OuOeipNL | 2304.0 | accepted | specie | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
31349 rows × 22 columns
# Obtener los taxones con variables en común
common_taxonIDs = set(distribution['taxonID']).intersection(species['taxonID']).intersection(taxon['taxonID'])
print(common_taxonIDs)
{'6RBK4', 'D5ZP', '7CTDT', '7RZW5', '72P8R', 'BP7FL', '86986', '8GWQF', '8G3C5', 'GVSX', '676MZ', '6WHSF', '3DVL6', '3KVG8', '6P75C', '524CV', '9GFZD', '9BWKL', '4KQQW', '532JC', '86JNS', '5L38J', '86S4B', '9BCPC', '5FPVK', '552PQ', '7ZLYJ', '46LW8', '9F568', '8QJLB', 'LJ2C', '7BFPS', '6TSHS', '6HWHQ', '8P8WT', 'BN3LN', '4HYYF', '5KW94', '9YQM', 'PWD5', '4B8GS', '86FMH', '85YY4', '555YN', '7XMSR', '699HV', '4MLC6', '6RHNT', '47FD6', '7SX8S', '4QTDQ', '64SQ4', '7JZPT', 'JR22', '39YWT', '9J5CP', '559WJ', '8P9YT', '3H55S', '4BQMW', '4V63V', '4GD5Q', '3TQMR', '7DF9C', '7ZPML', '74KSW', '4PCM4', '64ZXD', '4WXCC', 'JWWP', '3QPR3', '6TWPK', 'D58M', '3GD7V', 'C4WM4', '8TDY9', '4MVL7', '3CGQK', '64VN3', '33HGB', '5X5GN', '3P6VT', '6KWGY', '5TX57', 'WK2C', '4KS6F', '4WWMQ', '7ZBLC', '4DJ6G', 'H66D', '79HGH', '3S468', '894CG', 'RVYX', 'B43VH', '3HDR7', '854PQ', '56V4T', 'CQ5G', 'B75BM', '7F8TJ', '46TDG', '93KL8', '6MMCC', '4VZMP', 'B7623', 'NC39', '3LGQD', '6BGKN', '67YTD', '6M566', '3L7YN', '69KXK', '7TDW6', '6SGW7', '76XPY', '7QQNJ'}
#Unir los 3 dataframe por la columna taxonID, manteniendo las filas que no hacen match
df = distribution.merge(species, on='taxonID', how='outer').merge(taxon, on='taxonID', how='outer')
df
| taxonID | occurrenceStatus | locationID | locality | countryCode | dcterms:source | gbif:isExtinct | gbif:isMarine | gbif:isFreshwater | gbif:isTerrestrial | ... | infragenericEpithet | specificEpithet | infraspecificEpithet | cultivarEpithet | nameAccordingTo | namePublishedIn | nomenclaturalCode | nomenclaturalStatus | taxonRemarks | dcterms:references | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 6L823 | native | NaN | Ecuador; Peru | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | T5NN | native | NaN | Panama | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | 7FVWC | native | mrgid:1912 | NaN | NaN | NaN | NaN | True | False | False | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 7FVWC | native | mrgid:8402 | NaN | NaN | Ax, P., & Sopott-Ehlers, B. (1987). Otoplanida... | NaN | True | False | False | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 3WT95 | native | tdwg:SUM | NaN | NaN | Group, S.F. (2023) SF specimen locality data f... | False | False | False | True | ... | NaN | luctuosa | NaN | NaN | NaN | Brunner von Wattenwyl, C. (1888) Monographie d... | ICZN | nomen legitimum | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3115289 | 7KXQ3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | longa | parviflora | NaN | NaN | Maire, & Weiller. (1961). In: Fl. Afrique N. 7... | ICN | NaN | NaN | NaN |
| 3115290 | 76H4K | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | sphaerica | NaN | NaN | NaN | H. Schaef., S. S. Renner. (2011). In: Taxon 60... | ICN | NaN | NaN | http://www.worldplants.de/?deeplink=Penelopeia... |
| 3115291 | 76MKV | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | nigella | NaN | NaN | NaN | Cuatrec. (1981). In: Phytologia 49(3): 248. | ICN | NaN | NaN | NaN |
| 3115292 | 3NNXQ | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | acuta | NaN | NaN | NaN | NaN | ICZN | nomen legitimum | NaN | NaN |
| 3115293 | VTHP | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3115294 rows × 31 columns
#Explorar columnas de cada df individualmente
print("distribution.shape:", distribution.columns)
print("species.shape:", species.columns)
print("taxon.shape:", taxon.columns)
distribution.shape: Index(['taxonID', 'occurrenceStatus', 'locationID', 'locality', 'countryCode',
'dcterms:source'],
dtype='object')
species.shape: Index(['taxonID', 'gbif:isExtinct', 'gbif:isMarine', 'gbif:isFreshwater',
'gbif:isTerrestrial'],
dtype='object')
taxon.shape: Index(['taxonID', 'parentNameUsageID', 'acceptedNameUsageID',
'originalNameUsageID', 'scientificNameID', 'datasetID',
'taxonomicStatus', 'taxonRank', 'scientificName',
'scientificNameAuthorship', 'col:notho', 'genericName',
'infragenericEpithet', 'specificEpithet', 'infraspecificEpithet',
'cultivarEpithet', 'nameAccordingTo', 'namePublishedIn',
'nomenclaturalCode', 'nomenclaturalStatus', 'taxonRemarks',
'dcterms:references'],
dtype='object')
# Generar el informe con pandas-profiling para df SpeciesProfile
df_profile = ProfileReport(df, title="Informe Pandas Profiling - SpeciesProfile Dataset", explorative=True)
df_profile.to_file("species_report.html")
/usr/local/lib/python3.10/dist-packages/ydata_profiling/profile_report.py:363: UserWarning: Try running command: 'pip install --upgrade Pillow' to avoid ValueError warnings.warn(
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
Export report to file: 0%| | 0/1 [00:00<?, ?it/s]
df_profile