Manual

Introduction

The MSj package aims at providing an API to the most common file format in mass spectrometry. The following file formats are currently supported:

mzXML

Public elements

The functions below are exported:

Data types

The main data type of the package is the abstract type MSj.MScontainer.

Mass spectrometry scans are stored in the following structures, inspired from the mzXML format, which is a subtype of MSj.MScontainer.

struct MSscan <: MScontainer
    num::Int                          # num
    rt::Float64                       # retentionTime
    tic::Float64                      # totIonCurrent
    mz::Vector{Float64}               # m/z
    int::Vector{Float64}              # intensity
    level::Int                        # msLevel
    basePeakMz::Float64               # basePeakMz
    basePeakIntensity::Float64        # basePeakIntensity
    precursor::Float64                # precursorMz
    polarity::String                  # polarity
    activationMethod::String          # activationMethod
    collisionEnergy::Float64          # collisionEnergy
end

Another subtype, MSj.Chromatogram, is used to store the retention time, the ionic current and the maximum value of the ion current.

struct Chromatogram  <: MScontainer
    rt::Vector{Float64}               # araay of retention times
    ic::Vector{Float64}               # array of ion current
    maxic::Float64                    # maximum ion current (used in plotting normalization)
end

Combination of mass spectra requires another subtype of MSj.MScontainer called MSj.MSscans (notice the ending s).

struct MSscans  <: MScontainer
    num::Vector{Int}                  # num
    rt::Vector{Float64}               # retentionTime
    tic::Float64                      # totIonCurrent
    mz::Vector{Float64}               # m/z
    int::Vector{Float64}              # intensity
    level::Vector{Int}                # msLevel
    basePeakMz::Float64               # basePeakMz
    basePeakIntensity::Float64        # basePeakIntensity
    precursor::Vector{Float64}        # precursorMz
    polarity::Vector{String}          # polarity
    activationMethod::Vector{String}  # activationMethod
    collisionEnergy::Vector{Float64}  # collisionEnergy
    s::Vector{Float64}                # variance
end

The MSj.MSscans structure is very similar to the MSj.MSscan one, except that the fields num, rt, precursor, polarity, activationMethod and collisionEnergy are vectors. The idea is to keep track of the history of the operations that have led to this result. For example, if a MSscans element is the result of the addition of two individual scans such as scans[1] + scans[2], then the numfield of resulting MSscans is an array [1, 2]. The same applies to the retention time, precursor m/z, polarity, activation method and collision energy fields.

Information

The info public function reads the content of a file, but without loading the mass spectrometry data, and returns a Vector{String}containing the number of scans, scans level and for MS/MS data, the precursor m/z, the activation method and energy. Additional information may be gained by setting verbose = true.

info(filename)
4-element Array{String,1}:
 "51 scans"
 "MS1+"
 "MS2+ 1255.5  CID(CE=18)"
 "MS3+ 902.33  PQD(CE=35)"

Importing data

When loading a file containing more than a single acquisition, the individual mass spectrometry scans are pushed into an array of MSj.MSscan. The individual scans may be retrieve from the array the usual way:

julia> scans = load("filename")
51-element Array{MSj.MSscan,1}:
 MSj.MSscan(1, 0.1384, 5.08195e6, [140.083, 140.167, 140.25, 140.333, 140.417, 140.5, 140.583, 140.667, 140.75, 140.833  …  1999.25, 1999.33, 1999.42, ....)
...

julia> scans[1]
MSj.MSscan(1, 0.1384, 5.08195e6, [140.083, 140.167, 140.25, 140.333, 140.417, 140.5, 140.583, 140.667, 140.75, 140.833  …  1999.25, 1999.33, 1999.42, ....)

As mentioned above, chromatograms may be retrieved from a file and imported in MSj.Chromatogram :

julia> chromatogram("filename")
MSj.Chromatogram([0.1384, 0.7307, 2.1379, 3.7578, 4.3442, 5.7689], [5.08195e6, 9727.2, 11.3032, 4.8084e6, 12203.5, 4.84455], 5.08195e6)

The function MSj.retention_time reads the retention time of an input file and returns a Vector{Float64}containing the time in seconds.

julia> MSj.retention_time("filename")
51-element Array{Float64,1}:
  0.1384
  0.7307
  2.1379
....MSj.FilterType

Exporting data

Note

This feature is currently under development.

Combining and filtering data

Average

The average returns the average of the mass spectra directly from a Vector{MSscan} after Importing data data or directly from the filename.

julia> average("filename")
MSj.MSscans([1, 2, 3 ....

julia> scans = load("filename")
51-element Array{MSj.MSscan,1}:
 MSj.MSscan(1, 0.1384, 5.08195e6, [140.083, 140.167, 140.25, 140.333, 140.417, 140.5, 140.583, 140.667, 140.75, 140.833  …  1999.25, 1999.33, 1999.42, ....)
...

julia> average(scans)
MSj.MSscans([1, 2, 3 ....

Operating on files takes more time than working on Vector{MSscan} but may be useful to reduce the memory load.

Without any argument the average function averages the entire content of the data and the chromatogram function operates on also on the entire data.

Filtering

The average and chromatogram functions may takes arguments to select specific fields of interest within the data and operate on them. The argument belongs to the MSj.FilterType. Their properties are listed below:

FilterType	Description	Arguments	Specificity
MSj.Scan	Scan num	Int, Vector{Int}	average, chromatogram
MSj.Level	MS level	Int, Vector{Int}	average, chromatogram
MSj.Polarity	Polarity	String, Vector{String}	average, chromatogram
MSj.Activation_Method	Activation method	String, Vector{String}	average, chromatogram
MSj.Activation_Energy	Activation energy	Real, Vector{Real}	average, chromatogram
MSj.Precursor	Precursor m/z	Real, Vector{Real}	average, chromatogram
MSj.RT	Retention time	Real, Vector{Real}, Vector{Vector{Real}}	average
MSj.IC	Ion current	Vector{Real}	average

Note

The filtering function goes first through all the arguments and setup an array of scan num that matches the conditions. Then it uses this array to calculate the average mass spectrum. So this procedure needs two passes through the data, which is not very efficient. This is a point to make better in the future.

When the argument is restricted to a single value, such as MSj.Scan(1), filtering is performed on that specific value. If the argument is a vector then filtering involves all the values within the range. Filtering on MSj.scan([1,10]) means that the result will be obtained for scans ranging from 1 to 10. The same applies for all FilterType with the exception of MSj.∆MZ, for which the first value of the vector represents the mz and the second value represents the spread ∆mz, so that filtering is operated for all mz value in the range [m/z - ∆mz , m/z + ∆mz]. The MSj.RT type may take a vector or vectors as argument, such `MSj.RT([ [1,10], [20, 30] ]). In that case, mass spectra will be averaged in [1,10] and [20,30] range.

These filters may be combined together if necessary. For example, the input below returns the average mass spectrum for:

the MS2 scans (level = 2),
precursor m/z 1255.5,
upon CID activation conditions
with an activation energy of 18
and for retention times in the 1 to 60 s range.

average("filename", MSj.Precursor(1255.5),
                    MSj.Activation_Energy(18),
                    MSj.Activation_Method("CID"),
                    MSj.Level(2),
                    MSj.RT( [1, 60] ),
                    )

Several filter types may also be combined for chromatograms:

chromatogram("filename", MSj.Precursor(1255.5),
                         MSj.Activation_Energy(18),
                         MSj.Activation_Method("CID"),
                         MSj.Level(2),
                         )

If the condition does not match any existing data, then an ErrorException is returned with the "No matching spectra." message.

The chromatogram function has some methods using MSj.MethodType arguments:

MethodType	Description	Arguments	Remark
MSj.TIC	Total ion current	None	Default
MSj.BasePeak	Base peak intensity	None
MSj.MZ	m/z range	Vector{Real}
MSj.∆MZ	m/z ± ∆mz	Vector{Real}

These types control the way chromatograms are calculated: either using the total ionic current, the base peak intensity or using a m/z range. The method argument of the MSj.chromatogramfunction is set to MSj.TIC() by default. This setting may be overruled by setting the method to desired value:

chromatogram("filename", method = MSj.BasePeak())
chromatogram("filename", method = MSj.MZ( [257, 259] ) ) 
chromatogram("filename", method = MSj.∆MZ( [258, 1] ) )

Extracting subsets

The extract returns a Vector of MSscanfrom either a file of from a Vector{MSscan} following a load command, which corresponds to the filter conditions. See the Filtering part above.

sub_set = extract("filename")                       # extracting without any conditions returns a vector identical to the output 
sub_set = extract("filename", MSj.Level(2) )        # extract MS/MS spectra
scans = load("test.mzxml")                          # load mass spectra
sub_set = extract(scans)                            # extract a sub_set without conditions returns the original data

Processing

Smooth

The smooth function is public and applies on MSscanor MSscans objects, with an optional method argument set to MSj.SG(5, 9, 0). Smoothing is performed on the int field using the Savinsky and Golay. The first argument is the order (5 by default), the second is the number of points (default 9) and the last, is the derivative level (0). The function returns an MScontainer type identical to the input.

Base line correction

Base line correction is performed using the baseline_correction function. This function as two methods and operates either on MScontainer or on Array of MSscan such as obtained after importing data.

baseline_correction(scans)
baseline_correction(scans, method = MSj.IPSA(51, 100))

The method argument allows choosing the algorithm.

Top Hat

This filter is based Top Hat transform used in image processing (wikipedia, Sauve et al. (2004). The region onto which the operation is performed is set using the regionfield of the MSj.TopHat. This filter removes every structure from the input which are smaller in size than the structuring element. Usually a region of 100 points is enough.This filter is fast and works quite well on large and complex backgrounds.

Iterative polynomial smoothing algorithm (IPSA)

The default algorithm is the IPSA for iterative polynomial smoothing algorithm (Wang et al. (2017). This iterative algorithm use a zero ordre Savinsly and Golay smoothing to estimate a background. Then a new input, constructed by taking the minimum of either the original spectrum or the background, is smooth again. The process is repeated until the maximum iteration is reached or when the background does not change much. The termination criteria has been changed from the original paper.

Locally weighted error sum of squares regression (LOESS)

The LOESS family of algorithm is based on non-parametric linear local regression where the regression is weighted to reduced the influence of more distant data. We use here the iterative robust estimation procedure where the weights are updated with a bisquare function of the median of the residuals. This algorithm takes the number of iteration to be performed. Usually 3 iteration is enough. This algorithm is slow and is not recommended. The implementation will be improved in future versions.

Peak picking

Pick-picking is performed using the public centroid function. It operates on MSscanor MSscanstype of data and return a similar type. It takes a method argument, set by default to the Signal to Noise Analysis method: MSj.SNRA.

centroid(scan)

Signal to Noise Ratio Analysis (SNRA)

Signal to noise ratio analysis is a very general approach, which relies on the definition of noise. Here, we use TopHat filter to define the noise. Then the signal to noise ratio is calculated. Peaks are found by searching for a local maximum for which the signal to noise ratio is above the threshold. By defaults the MSj.SNRA uses a threshold = 1.0 and a structuring element of 100 points.

centroid(scan, method = MSj.SNRA(1., 100)

Threshold base peak detection algorithm (TBPD)

The TBPD method identifies features based on their similarity (as described by the Pearson correlation coefficient) with a template peak. By default the MSj.TBPD method type uses a Gaussian function, with 1000 mass resolving power and a threshold level set to 0.2% as :

centroid(scan, method = MSj.TBPD(:gauss, 1000, 0.2)

Two other shape functions are available:

:loretz which uses a Cauchy-Lorentz function and
:voigt which implements a pseudo-voigt profile (Ida et al., J. Appl. Cryst. (2000), Wikipedia)

The :lorentz profile fits better Fourrier Transform mass spectra. The :voigt shape is the result of the convolution of gaussi and Cauchy-Lorentz shape.

centroid(scan, method = MSj.TBPD(:lorentz, 1000., 0.1)
centroid(scan, method = MSj.TBPD(:voight,  1000., 0.1)

Properties calculations

From a molecular formula, it is possible to :

calculate the molecular masses
calculate the isotopic distributions
simulate a mass spectrum providing and isotopic distribution and a peak width.

Parsing a chemical formula

The private function MSj.formula takes an input string representing a chemical formula, such as "CH4", counts the number of atoms in the formula and returns a dictionary.

MSj.formula("CH3Br")
Dict("Br" => 1,"C" => 1,"H" => 3)
Dict{String,Int64} with 3 entries:
  "Br" => 1
  "C"  => 1
  "H"  => 3

In the previous example, the dictionary had three elements corresponding to the three atomic species found in the "CH3Br" formula. The "Br" key has the value 1, the "C" key has also 1 and the "H" key has a value equals to 3. The output of this function will be used by all the other functions, which will calculate properties from it. The MSj.formula function accepts stoichiometric regular molecular formulas as well as more developed forms. For hexane the following entries "C6H14", "CH3CH2CH2CH2CH2CH3" or "CH3(CH2)4CH3" are equivalents.

MSj.formula("C2H14");
Dict("C" => 2,"H" => 14)
MSj.formula("CH3CH2CH2CH2CH2CH3");
Dict("C" => 2,"H" => 14)
MSj.formula("CH3(CH2)4CH3");
Dict("C" => 2,"H" => 14)

The indices found after a parenthesis are used to multiply the elements inside the parenthesis. If no indices are found after ")", then the group is not multiply. The parenthesis are also used to specify an isotope, for example, consider ethane isotopically labelled with carbon 13:

MSj.formula("CH3(13C)H3");
Dict("C" => 1,"13C" => 1,"H" => 6)

These expressions are equivalent: "CH3(13C)H3", "CH3(13CH3)", "C(13C)H6". The following isotopes are recognized by MSj.formula:

1H for ^1H_1 (protium)
2H for ${}^2H_{2}$ (deuterium)
D equivalent to 2H
12C for 12C6
13C for 13C6
14N for 14N7
15N for 15N7
16O for 16O8
18O for 18O8
32S for 32S16
34S for 34S16

The elements are stored in a dictionary called MSj.Elements. Each key of the MSj.Elements points to an Array of MSj.Isotope, which is a structure used to stored the different properties of the isotopes:

struct Isotope
    m::Float64           # mass
    f::Float64           # natural frequency
    logf::Float64        # logarythm of the natural frequency
    Z::Int               # atomic number
    A::Int               # mass number
    active::Bool         # is radioactive
end

The isotopes are sorted by natural frequency. Hence, for instance, the first sulfur isotope is 32S16 with a natural frenquency of 0.995 followed by 34S16 with 0.043, etc.

julia> E = MSj.Elements["S"]
4-element Array{MSj.Isotope,1}:
 MSj.Isotope(31.9720711741, 0.9498500119990401, -0.051451188958515866, 16, 32, false)
 MSj.Isotope(33.96786703, 0.04252059835213182, -3.157766653355949, 16, 34, false)
 MSj.Isotope(32.9714589101, 0.00751939844812415, -4.890269137820559, 16, 33, false)
 MSj.Isotope(35.9670812, 0.00010999120070394368, -9.115110188972029, 16, 36, false)

The first element of the array, is the most naturally abundant isotope. The properties of the isotopes may be accessed by:

	julia> E = MSj.Elements["S"];
julia> E[1]           # returns the most abundant isotope of sulfur
MSj.Isotope(31.9720711741, 0.9498500119990401, -0.051451188958515866, 16, 32, false)
julia> E[2].f       # returns natural abundance of the second most abundant istotope of sulfur.
0.04252059835213182

Molecular masses

The public function masses, return a dictionary with three keys "Monoisotopic", "Average" and "Nominal", defined as:

Mass	Element	Molecule or ion
Nominal	Mass number of the most abundant stable isotope	Sum of the nominal masses of the most abundant stable isotopes
Monoistopic	Atomic mass of the most abundant stable isotope	Sum of the atomic masses of the most abundant stable isotopes
Average	Mass average function of the relative abundances	Sum of the average atomic weights of the constituents

The masses for hexane are obtained by:

julia> m = masses("CH3(CH2)4CH3")
Dict("C" => 6,"H" => 14)
Dict{String,Float64} with 3 entries:
  "Monoisotopic" => 86.1096
  "Average"      => 86.178
  "Nominal"      => 86.0

And my be recovered by:

julia> m["Average"]
86.178
julia> m["Monoisotopic"]
86.10955045178
julia> m["Nominal"]
86.0

Isotopic distributions

The public function isotopic_distribution calculates the isotopic distribution of the given formula. The function takes the following arguments:

formula: String
target probability: Real number
charge: optional argument Int, by default = +1
tau: optional Real number set by default to 0.1
the elements dictionary: set by default to MSj.Elements.

The calculations is based on the implementation of the isospec algorithm. Briefly, the algorithm search for the small set of isotopologues for which the total abundance is equal to the target probability. The calculation return a vector with all the configurations found. The first column gives the masses, the second column the probabilities of the configurations, and the following columns gives the configurations, such as:

julia> I = isotopic_distribution("C6H14", 0.9999, charge = +1)
Dict("C" => 6,"H" => 14)
5×6 Array{Union{Float64, Int64, String},2}:
   "Masses"   "Probability"   "12C"   "13C"    "1H"   "2H"
 86.1096     0.935476        6       0       14      0
 87.1129     0.0612122       5       1       14      0 
 88.1163     0.00166891      4       2       14      0 
 87.1158     0.00151559      6       0       13      1

The resulting array is intend to be readable easily, and thus the first line contains the descriptors or the columns. It may also be easily exported to a delimited file.

Simulated mass spectra

The result of an isotopic distribution calculation may be convoluted by a experimental peak shape to produce a simulated mass spectrum. Such an operatio, is achieved by the simulate function. The function takes the following arguments:

I: the Arrayresulting from the isotopic_distribution calculation
∆mz: the width in Dalton of the peak shape
model: the model used for the peak shape. By default it is set to model=:gauss, but may be model=:lorentz or model=:voigt.
Npoints: an Int value for the number of points in the mass spectrum. By default set to Npoints=1000.

julia> I = isotopic_distribution("C6H14", 0.9999, charge = +1)
Dict("C" => 6,"H" => 14)
julia> sim = simulate(I, 0.05, model= :lorentz);
julia> Using Plots
julia> plot(sim)
 ```
See [## Plotting](@ref) for more information on how to plot mass spectra.


# Plotting
Plotting facilities are available as a submodule to the `MSj` package.  The [`MSj.plots`](@ref) module relies on the [RecipesBase package](https://github.com/JuliaPlots/RecipesBase.jl), which allows writing recipes to plot users' data types. Hence, recipes have been created for `MSscan`, `Msscans` and `Chromatogram`:

julia plot(scans[1], method = :relative))

```

By default plotting is made in relative intensities, which may be changed by setting method to :absolute.