---
title: "Analysing Phytochemical Diversity -- an introduction to *chemodiv*"
author: "Hampus Petrén^1^, Tobias G. Köllner^2^, Robert R. Junker^1,3^"
date: "`r Sys.Date()`"
output: rmarkdown::html_document
vignette: >
  %\VignetteIndexEntry{Analysing Phytochemical Diversity -- An introduction to *chemodiv*}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

^1^ Department of Biology, Philipps-University Marburg, Marburg, Germany

^2^ Department of Natural Product Biosynthesis, Max Planck Institute for Chemical Ecology, Jena, Germany

^3^ Department of Environment and Biodiversity, University of Salzburg, Salzburg, Austria

## Introduction
`chemodiv` is an R package for analysing phytochemical data. It includes
a number of functions that makes it straightforward to quantify and
visualize phytochemical diversity and dissimilarity for any type of
phytochemical samples, such as herbivore defence compounds, volatiles 
or similar. Importantly, calculations of diversity and dissimilarity can 
incorporate biosynthetic and/or structural properties of the phytochemical 
compounds, resulting in more comprehensive quantifications of 
diversity and dissimilarity.

This introduction serves as a tutorial explaining the main intended use of 
all the functions in the package. A complete description of the package is
available in Petrén et al. 2023a, while a more in-depth discussion and review
of phytochemical diversity is available in Petrén et al. 2023b.


## Setup
The current version of the package can be installed from CRAN. Alternatively,
the developmental version of the package can be installed from GitHub
using the `install_github()` function from the `devtools` package.

```{r, eval = FALSE}
# Install current version
install.packages("chemodiv")

# Install developmental version
install.packages("devtools") # Install devtools if not already installed
library("devtools")
install_github("hpetren/chemodiv")
```

```{r}
# Load chemodiv
library(chemodiv)
```


## Using *chemodiv*

### Data formatting

We illustrate the use of the *chemodiv* package by using data,
included in the package, on floral scent of the plant *Arabis alpina*
(Petrén et al. 2021). We use a subset of the data consisting of 87 individuals 
from three populations.

Two separate datasets are needed to use the full set of analyses in 
the *chemodiv* package. 

The first dataset, which we call the *sample* dataset, should be a 
data frame containing data on the relative concentrations (proportions) of 
different compounds (columns) in different samples (rows):
```{r}
data("alpinaSampData")

head(alpinaSampData)[,1:5]
```
The dataset contains the relative concentration of 15 phytochemical 
compounds in different samples.

The second dataset, which we call the *compound* dataset, should be a 
data frame containing, in each of three columns, the common name, 
SMILES and InChIKey IDs of all the compounds that are present in the
*sample* dataset:
```{r}
data("alpinaCompData")

head(alpinaCompData)
```
SMILES and InChIKey are chemical identifiers that are easily obtained for 
each compound by searching for compounds in
[PubChem](https://pubchem.ncbi.nlm.nih.gov/). Searching for compounds
by their common name, or a number of other chemical identifiers, will
bring up the matching molecule, along with the SMILES and InChIKey.
Additionally, various automated tools such as the 
[PubChem Identifier Exchange Service](https://pubchem.ncbi.nlm.nih.gov/idexchange/idexchange.cgi) 
or [The Chemical Translation Service](https://cts.fiehnlab.ucdavis.edu/)
can be used to automatically obtain IDs for lists of compounds.
The user has to compile the SMILES and InChIKey manually to ensure correctness,
as lists of compounds very often contain compounds wrongly named, 
wrongly formatted, under various synonyms etc. which prevents efficient
automatic translation of compound names to SMILES and InChIKey.

As the *compound* dataset consists of a list of known compounds,
analyses in the *chemodiv* package will work best for sets of data,
commonly generated by chemical ecologists using GC-MS, LC-MS or similar, 
where all or most compounds in the samples have been confidently identified.

When both datasets are prepared, we can use the function `chemoDivCheck()` 
to make sure that the *sample* dataset and the *compound* dataset 
are formatted correctly, so that they can be used by the other functions
in the package:
```{r}
chemoDivCheck(compoundData = alpinaCompData, sampleData = alpinaSampData)
```

In addition to the *sample* and *compound* datasets, a third dataset
indicating what groups different samples belong to can be used in the 
plotting functions below. 

```{r}
data("alpinaPopData")

table(alpinaPopData)
```
In this case, our samples belong to three different populations:
G1 (Greece), It8 (Italy) and S1 (Sweden).

Now we have all necessary datasets, which are correctly formatted,
and we can begin with analyses. 


### Compound classification and dissimilarities
Two functions are used to classify and compare phytochemical compounds,
and are applied to the *compound* dataset.

The function `NPCTable()` classifies compounds with 
[NPClassifier](https://npclassifier.ucsd.edu/) (Kim et al. 2021). NPClassifier 
is a deep-learning tool that classifies phytochemical compounds into a 
hierarchical classification of three groups, pathway, superclass and class, 
largely corresponding to biosynthetic pathways:
```{r, eval = FALSE}
alpinaNPC <- NPCTable(compoundData = alpinaCompData)

alpinaNPC[1,] # Classification of the first compound in dataset
```

```{r, echo = FALSE}
data("alpinaNPCTable")
alpinaNPC <- alpinaNPCTable
rm(alpinaNPCTable)
alpinaNPC[1,]
```

Here, the first compound in the list, *(Z)-beta-Ocimene*, is classified as
Terpenoids > Monoterpenoids > Acyclic monoterpenoids.
The classification of the compounds is put into a data frame and can 
subsequently be used by other functions.

***

The function `compDis()` compares phytochemical compounds by calculating
pairwise Jaccard dissimilarities between them. The dissimilarity 
calculations can be based on the biosynthesis and/or structure 
of the compounds. `type = "NPClassifier"` calculates dissimilarities 
based on the classification made by NPClassifier, which largely corresponds
to biosynthetic pathways. `type = "PubChemFingerprint"` and
`type = "fMCS"` are two similar methods that calculate dissimilarities 
based on the structure of the compounds. In this case, molecules that
have similar substructures/features will have a low dissimilarity,
while molecules not having similar substructures/features will have
a high dissimilarity.

We calculate compound dissimilarities with `type = "PubChemFingerprint"`.
```{r, eval = FALSE}
alpinaCompDis <- compDis(compoundData = alpinaCompData,
                         type = "PubChemFingerprint")

alpinaCompDis$fingerDisMat[1:4, 1:4] # Part of compound dissimilarity matrix
```

```{r, echo = FALSE, message = FALSE}
data("alpinaCompDis")
alpinaCompDisMat <- alpinaCompDis
rm(alpinaCompDis)
alpinaCompDis <- list()
alpinaCompDis[["fingerDisMat"]] <- alpinaCompDisMat
alpinaCompDis$fingerDisMat[1:4, 1:4]
```

The output from the function is a list with one or several compound 
dissimilarity matrices, depending on which `type` was used as input.
If multiple `type` are used as input, a matrix with mean values of the other
matrices will also calculated. If `type` includes `"NPClassifier"`, a matrix
with "mixed" values is also calculated. In this case, values are based on
NPClassifier when these are *> 0*, and otherwise based the 
PubChem fingerprints/fMCS values (see manual for details).
Importantly, a resulting matrix of compound dissimilarities is used by
other functions in the package that quantify phytochemical diversity
and dissimilarity, and can be used to visualize how similar sets of
compounds are to each other.


### Diversity calculations
Calculations of phytochemical diversity is the core of the *chemodiv* package. 
Phytochemical diversity can be calculated for the *sample* dataset 
with functions `calcDiv()`, `calcBetaDiv()` and `calcDivProf()`.

`calcDiv()` calculates alpha diversity for each sample (row) in the 
*sample* dataset. The function can calculate a 
number of different diversity and evenness indices, depending on
what `type` is used as input. The default and recommended way of calculating 
diversity is as Hill numbers (Chao et al. 2014), which provides 
a number of advantages, including the use of the parameter *q* which 
controls the sensitivity of the measure to the relative concentrations
of compounds. This can be done as "normal" Hill diversity, which depends
on compound richness and evenness, and as functional Hill diversity, 
which additionally considers compound dissimilarity (Chiu & Chao 2014),
utilizing the compound dissimilarity matrix from `compDis()`. This means that,
for calculations of functional Hill diversity, a set of compounds that are
biosynthetically/structurally different from each other is more diverse
than a similar set of compounds that are biosynthetically/structurally 
more similar to each other.

We calculate functional Hill diversity for *q = 1*. 
```{r}
alpinaDiv <- calcDiv(sampleData = alpinaSampData, 
                     compDisMat = alpinaCompDis$fingerDisMat,
                     type = "FuncHillDiv",
                     q = 1)

head(alpinaDiv)
```
The function outputs a data frame with samples as rows and the selected 
`type` of measures as columns.

***

`calcDivProf()` can be used to calculate Hill diversity for a range of 
q-values simultaneously, generating a so called diversity profile. This 
allows for a more nuanced exploration of the diversity.

We calculate a diversity profile for functional Hill diversity, using the
default range of *q = 0* to *q = 3*. 
```{r}
alpinaDivProf <- calcDivProf(sampleData = alpinaSampData,
                             compDisMat = alpinaCompDis$fingerDisMat,
                             type = "FuncHillDiv")

head(alpinaDivProf$divProf)[,1:5] # Part of the diversity profile data frame
```
The function outputs a list with input parameters and the diversity profile 
as a data frame, with samples as rows and diversity values at different
values of *q* as columns.

***

`calcBetaDiv()` calculates beta diversity for a set of samples, in the
Hill numbers framework. The function calculates a single beta-diversity value 
for the supplied sample data. This is calculated as *beta = gamma / alpha*,
where gamma is the diversity of the pooled data set and alpha represents 
the mean diversity of individual samples.
```{r}
alpinaBetaDiv <- calcBetaDiv(sampleData = alpinaSampData,
                             compDisMat = alpinaCompDis$fingerDisMat,
                             type = "FuncHillDiv")

alpinaBetaDiv
```
The function outputs a data frame with type of beta diversity calculated, *q*,
and values for gamma diversity, mean alpha diversity and beta diversity.


### Sample dissimilarities
Calculations of pairwise phytochemical dissimilarities between samples can 
be made with the function `sampDis()`. This calculates Bray-Curtis 
dissimilarities and/or Generalized UniFrac dissimilarities 
(Chen et al. 2012, Junker 2018) between samples in the *sample* 
dataset. Generalized UniFrac dissimilarities utilize the 
compound dissimilarity matrix from `compDis()`, such that two samples 
containing more biosynthetically/structurally different compounds have a 
higher pairwise dissimilarity than two samples containing more
biosynthetically/structurally similar compounds.

We calculate phytochemical dissimilarity as Generalized UniFrac dissimilarities:
```{r, message = FALSE}
alpinaSampDis <- sampDis(sampleData = alpinaSampData,
                         compDisMat = alpinaCompDis$fingerDisMat,
                         type = "GenUniFrac")

alpinaSampDis$GenUniFrac[1:4, 1:4] # Part of sample dissimilarity matrix
```
The output from the function is a list with one or several sample 
dissimilarity matrices, depending on which `type` was used as input.


### Molecular network
Function `molNet()` uses a matrix generated by the `compDis()` function to 
create a molecular network of the phytochemical compounds, and calculates 
some properties of the network. Such networks are useful to
visualize relationships between compound similarities and abundances.

We create a molecular network based on the compound dissimilarity matrix.
We can also include the the NPClassifier classification, where the "pathway"
group will control node colour. We manually set `cutOff = 0.75` in this case.
This limits the number of edges between nodes, as edges are only plotted 
between nodes if their similarity (where *similarity = 1 - dissimilarity*) 
is larger than the cut-off value.
```{r, message = FALSE}
alpinaNetwork <- molNet(compDisMat = alpinaCompDis$fingerDisMat,
                        npcTable = alpinaNPC,
                        cutOff = 0.75)

summary(alpinaNetwork)
```
The output is a list including the network object, and some basic 
network parameters.


### Chemodiversity and network plots
Once we have calculated different measures of phytochemical diversity and 
dissimilarity using the above functions, two plotting functions can be used to 
conveniently create different types of plots and molecular networks.

`molNetPlot()` creates a basic plot of the molecular network generated
by the `molNet()` function. We create a single molecular network for the whole
dataset, including compound names and the classification by NPClassifier.
```{r, fig.width = 12, fig.height = 8, out.width = "95%"}
molNetPlot(sampleData = alpinaSampData,
           networkObject = alpinaNetwork$networkObject,
           npcTable = alpinaNPC,
           plotNames = TRUE)
```

Nodes represent individual compounds. They are coloured by their 
"pathway" classification by NPClassifier. Node size corresponds to the mean
proportional concentration of the compounds in the samples. Edge widths 
represent compound similarity, and are only plotted for similarities higher 
than the cutoff value. We see that the floral scent bouquet consists
mostly of compounds belonging to the "Shikimates and Phenylpropanoids" 
pathway. These compounds are connected to each other in a network, but 
separated from the three "Terpenoids" compounds, indicating that the two 
groups of compounds belonging to different pathways are also structurally 
different to each other.

***

`chemoDivPlot()` can be used to conveniently create basic plots of the 
different types of diversity and dissimilarity measurements calculated by 
the functions above. Four types of plots can be created, in any combination.
With argument `compDisMat` a dendrogram visualizing compound dissimilarities is 
created. With argument `divData`, diversity/evenness values are visualized 
with boxplots. With argument `divProfData` a diversity profile will be created. 
With argument `sampDisMat` sample dissimilarities will be visualized as an 
NMDS plot. Grouping data can be supplied with argument `groupData`.
```{r, fig.width = 12, fig.height = 8, out.width = "95%"}
chemoDivPlot(compDisMat = alpinaCompDis$fingerDisMat,
             divData = alpinaDiv,
             divProfData = alpinaDivProf,
             sampDisMat = alpinaSampDis$GenUniFrac,
             groupData = alpinaPopData)
```

The output consists of the selected plots. The dendrogram visualizes 
similarities between compounds in a way complementary to the molecular
network above, with the three terpenoids having a high dissimilarity to 
the other compounds. The boxplot indicates that the functional Hill diversity
of the floral scent compounds is highest in the G1 population. Further
analyses can examine more exactly what components of diversity are higher in 
this population. The diversity profile demonstrates how at *q = 1* 
(shown also in boxplot) diversity is highest for population G1, but at *q = 0*,
where compound proportions are not taken into account, diversity is 
highest for population It8. Finally, the NMDS indicates that all three
populations are compositionally/structurally different to each other, and 
that in population It8, dissimilarities between samples in the same 
population is lower than for the other two populations.

In this example, we end at the plots visualizing phytochemical diversity
and dissimilarity. Such plots may inform about patterns of variation within
and between groups of samples that represent different populations, species
etc. Importantly, testing for associations between measures of 
diversity/dissimilarity and variables such as herbivore performance, 
pollinator visitation rates and plant fitness may provide insights on the 
effects of phytochemical variation on plants for various ecological
interactions and evolutionary processes.


### Shortcut function
The function `quickChemoDiv()` uses many of the above functions to in one 
simple step calculate and visualize chemodiversity for users wanting to 
quickly explore their data using standard parameters. This can be used
to generate the same four types of plots as above.
```{r, eval = FALSE}
quickChemoDiv(compoundData = alpinaCompData,
              sampleData = alpinaSampData,
              groupData = alpinaPopData,
              outputType = "plots") # Not run
```


## Solutions to common questions and issues

* **Getting datasets in order.** See `?chemodiv` for a detailed description
on how to structure datasets, and compile chemical identifiers for compounds.
* **Data with unidentified compounds.** See `?chemodiv` for a description on
how datasets with missing data can be handled, and for alternative ways
to calculate compound dissimilarities when most or all compounds are
unidentified.
* **What method should I use to calculate compound dissimilarities?** See
`?compDis` for details on how compound dissimilarities are calculated using
the three different methods. See Petrén et al. 2023a for a discussion on
what method is suitable depending on type of data and research
question addressed.
* **Long computation times with `compDis()`.** The `compDis()` function can
calculate compound dissimilarities using `NPClassifier`, `PubChemFingerprint`
and `fMCS`. For larger datasets, this will take some time as data is
downloaded and pairwise compound dissimilarities are calculated. Of the three
methods, `fMCS` is much more computationally intensive than the
others, and may take a very long time for datasets with a large number
of structurally complex molecules. In such cases, it is recommended to
use `PubChemFingerprint` instead.
* **What kind of phytochemical diversity should I calculate?** See
`?calcDiv` for a detailed description on how different measures of
diversity and evenness are calculated. See Petrén et al. 2023a for a 
comparison of different diversity indices, and Petrén et al. 2023b for a 
more in-depth discussion and review of different components and measures of
phytochemical diversity in the context of their function, mechanism
and ecology.
* **Long computation times with `chemoDivPlot()`.** The `chemoDivPlot()`
function can be used to conveniently create basic plots of chemodiversity.
If the function argument `sampDisMat` is included, a Nonmetric Multidimensional 
Scaling (NMDS) will be performed. This might take some time for larger
datasets, and excluding this argument will make the plotting much quicker.
* **Can I customize plots created by `chemoDivPlot()` or `molNetPlot()`?**
These functions exist to provide an easy way to create basic chemodiversity 
plots and molecular networks, and therefore have limited customization
options. Customized plots are easily created with the `ggplot2` 
and/or `ggraph` packages.
* **Installing `chemodiv` gives a warning message.**
Installing the package using `install.packages("chemodiv")` may result in
the warning message `Warning in install.packages : dependencies ‘fmcsR’,
‘ChemmineR’ are not available`, meaning that these package dependencies 
from Bioconductor have not been installed. It is recommended to install the 
package using the `install()` function in  the `BiocManager` package instead.
See the installation instructions in the README file for details.
* **I get an error about missing packages when trying to run some functions.**
If you get the error message `Error in loadNamespace(x) : there is no 
package called ‘ChemmineR’` or `Error in loadNamespace(x) : there is no 
package called ‘fmcsR’`, it means these package dependencies from Bioconductor 
have not been installed. Either install these separately, or
reinstall `chemodiv` using the `install()` function in  the `BiocManager` 
package. See the installation instructions in the README file for details.


## References

Chao, A., C.-H. Chiu, and L. Jost. 2014. Unifying Species Diversity, 
Phylogenetic Diversity, Functional Diversity, and Related Similarity 
and Differentiation Measures Through Hill Numbers. Annu. Rev. Ecol. Evol. Syst. 
45:297–324.

Chen, J., K. Bittinger, E. S. Charlson, C. Hoffmann, J. Lewis, G. D. Wu, 
R. G. Collman, F. D. Bushman, and H. Li. 2012. Associating microbiome 
composition with environmental covariates using generalized UniFrac distances. 
Bioinformatics 28:2106–2113.

Chiu, C.-H., and A. Chao. 2014. Distance-Based Functional Diversity 
Measures and Their Decomposition: A Framework Based on Hill Numbers. 
PLoS ONE 9:e100014.

Junker, R. R. 2018. A biosynthetically informed distance measure to 
compare secondary metabolite profiles. Chemoecology 28:29–37.

Kim, H. W., M. Wang, C. A. Leber, L.-F. Nothias, R. Reher, K. B. Kang, 
J. J. J. van der Hooft, P. C. Dorrestein, W. H. Gerwick, and G. W. Cottrell. 
2021. NPClassifier: A Deep Neural Network-Based Structural Classification 
Tool for Natural Products. J. Nat. Prod. 84:2795–2807.

Petrén, H., P. Toräng, J. Ågren, and M. Friberg. 2021.
Evolution of floral scent in relation to self-incompatibility and 
capacity for autonomous self-pollination in the perennial herb *Arabis alpina*. 
Annals of Botany 127:737–747.

Petrén H., T. G. Köllner and R. R. Junker. 2023a. Quantifying chemodiversity 
considering biochemical and structural properties of compounds with the 
R package *chemodiv*. New Phytologist 237: 2478-2492.

Petrén H., R. A. Anaia, K. S. Aragam, A. Bräutigam, S. Eckert, R. Heinen,
R. Jakobs, L. Ojeda, M. Popp, R. Sasidharan, J-P. Schnitzler, A. Steppuhn,
F. Thon, S. Tschikin, S. B. Unsicker, N. M. van Dam, W. W. Weisser,
M. J. Wittmann, S. Yepes, D. Ziaja, C. Müller, R. R. Junker. 2023b.
Understanding the phytochemical diversity of plants: Quantification,
variation and ecological function. bioRxiv doi: 10.1101/2023.03.23.533415.