mpactr requires 2 files as imput: a feature table and metadata file. Both are expected to be comma separated files (.csv).
To import these data into R, use the mpactr function
import_data()
, which has the arguments:
peak_table_file_path
and meta_data_file_path
.
This tutorial will show you examples with data from the original mpact
program, found on GitHub.
This dataset contain 38 samples for biological groups solvent blanks,
media blanks, Streptomyces sp. PTY08712 grown with (250um_Ce) and
without (0um_Ce) rare earth element cerium. For more information about
the experiments conducted see Samples,
Puckett, and Balunas 2023.
The original program has sample metadata split across two files: the
sample list and metadata. mpactr accepts a single file, so we need to
combine these in one prior to import with
import_data()
.
Loading the sample list…
samplelist <- read_csv(example_path("PTY087I2_samplelist.csv"))
#> Rows: 38 Columns: 8
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (4): Injection, MS method, LC method, Sample_Code
#> dbl (1): Injection volume
#> lgl (2): File Text, Sample_Notes
#> time (1): Vial_Position
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Loading the metadata list…
metadata <- read_csv(example_path("PTY087I2_extractmetadata.csv"))
#> Rows: 10 Columns: 4
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (4): Sample_Code, Organism, Biological_Group, Extract_Notes
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Our sample list contains additional blank samples that are not in the feature table, and therefore should be removed prior to import.
samples <- read_csv(example_path("PTY087I2_dataset.csv"), skip = 2) %>%
colnames() %>%
str_subset(., "200826")
#> Rows: 4956 Columns: 33
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (1): Compound
#> dbl (32): m/z, Retention time (min), 200826_blank1_r1, 200826_blank1_r2, 200...
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
meta_data <- samplelist %>%
left_join(metadata, by = "Sample_Code") %>%
filter(Injection %in% samples)
Now we can import the data. We will provide the url for the
peak_table
, and our reformatted meta_data object. This peak
table was exported from Progenesis, so we will set the
format
argument to Progenesis.
data <- import_data(peak_table = example_path("PTY087I2_dataset.csv"),
meta_data = meta_data,
format = "Progenesis"
)
This will create an R6 class object, which will store both the peak table and metadata.
Calling the new mpactr object will print the current peak table in the terminal:
data
#> Compound mz rt 200826_blank1_r1 200826_blank1_r2
#> <char> <num> <num> <num> <num>
#> 1: 0.80_418.1451n 419.1521 0.8027667 0.00000 0.000000
#> 2: 0.81_210.0803m/z 210.0803 0.8099167 0.00000 0.000000
#> 3: 4.11_444.1061n 889.2190 4.1109833 0.00000 0.000000
#> 4: 4.11_400.0799n 401.0872 4.1109833 0.00000 0.000000
#> 5: 0.80_627.2171n 650.2063 0.8027667 0.00000 0.000000
#> ---
#> 4952: 8.95_740.1659m/z 740.1659 8.9477667 50.86249 0.000000
#> 4953: 9.94_801.6828m/z 801.6828 9.9434667 11.08239 1.229457
#> 4954: 8.45_702.2132m/z 702.2132 8.4499000 257.47905 38.155170
#> 4955: 8.35_664.5170m/z 664.5170 8.3506167 284.44529 291.579543
#> 4956: 8.88_467.1028m/z 467.1028 8.8770500 2150.74323 0.000000
#> 200826_blank1_r3 200826_PTY087I2_0umce1_r1 200826_PTY087I2_0umce1_r2
#> <num> <num> <num>
#> 1: 0.00000000 3.020244e+05 3.056620e+05
#> 2: 0.00000000 1.655825e+05 1.668336e+05
#> 3: 0.00000000 3.401869e+04 3.530162e+04
#> 4: 0.00000000 4.358187e+04 4.370543e+04
#> 5: 0.00000000 4.332639e+04 4.388268e+04
#> ---
#> 4952: 0.00000000 4.148686e+01 7.871277e+00
#> 4953: 9.07128542 9.266055e-02 1.530647e-01
#> 4954: 0.00000000 1.732118e+03 0.000000e+00
#> 4955: 0.01844371 3.013042e+01 3.180264e+01
#> 4956: 0.00000000 0.000000e+00 1.028675e+03
#> 200826_PTY087I2_0umce1_r3 200826_PTY087I2_0umce2_r1
#> <num> <num>
#> 1: 3.101702e+05 2.978810e+05
#> 2: 1.675075e+05 1.625522e+05
#> 3: 3.543251e+04 2.000822e+04
#> 4: 4.357772e+04 3.155634e+04
#> 5: 4.557824e+04 4.335971e+04
#> ---
#> 4952: 0.000000e+00 5.231151e-01
#> 4953: 3.667251e-01 0.000000e+00
#> 4954: 0.000000e+00 1.925907e+03
#> 4955: 3.135295e+01 3.848152e+01
#> 4956: 0.000000e+00 0.000000e+00
#> 200826_PTY087I2_0umce2_r2 200826_PTY087I2_0umce2_r3
#> <num> <num>
#> 1: 3.035890e+05 302993.80270
#> 2: 1.647463e+05 163903.73400
#> 3: 2.005072e+04 20880.29986
#> 4: 3.127270e+04 32045.00687
#> 5: 4.480789e+04 44925.54654
#> ---
#> 4952: 0.000000e+00 0.00000
#> 4953: 4.336638e+00 16.76814
#> 4954: 0.000000e+00 0.00000
#> 4955: 3.330928e+01 39.86223
#> 4956: 1.324812e+03 0.00000
#> 200826_PTY087I2_0umce3_r1 200826_PTY087I2_0umce3_r2
#> <num> <num>
#> 1: 302974.58390 303741.67880
#> 2: 163537.75040 163110.51990
#> 3: 22257.61067 23254.43336
#> 4: 32885.51773 34089.23273
#> 5: 44175.78590 44779.28226
#> ---
#> 4952: 0.00000 0.00000
#> 4953: 22.20122 27.34200
#> 4954: 3572.79179 0.00000
#> 4955: 40.88535 31.80185
#> 4956: 0.00000 2695.52447
#> 200826_PTY087I2_0umce3_r3 200826_PTY087I2_250umce1_r1
#> <num> <num>
#> 1: 305307.58880 304461.58900
#> 2: 162538.71820 159919.35920
#> 3: 23551.62571 115011.92150
#> 4: 34572.55101 86989.85992
#> 5: 45432.68848 46936.50804
#> ---
#> 4952: 0.00000 10.90783
#> 4953: 18.40502 19.33673
#> 4954: 0.00000 2909.45195
#> 4955: 37.87828 36.48544
#> 4956: 0.00000 0.00000
#> 200826_PTY087I2_250umce1_r2 200826_PTY087I2_250umce1_r3
#> <num> <num>
#> 1: 3.059096e+05 301102.32680
#> 2: 1.622441e+05 160509.21490
#> 3: 1.137904e+05 111088.30390
#> 4: 8.660144e+04 84092.94428
#> 5: 4.671744e+04 44639.13434
#> ---
#> 4952: 0.000000e+00 0.00000
#> 4953: 8.184311e+00 21.90738
#> 4954: 0.000000e+00 0.00000
#> 4955: 3.628852e+01 29.76471
#> 4956: 2.857433e+03 0.00000
#> 200826_PTY087I2_250umce2_r1 200826_PTY087I2_250umce2_r2
#> <num> <num>
#> 1: 2.950731e+05 294507.85260
#> 2: 1.569863e+05 157018.74160
#> 3: 9.171678e+04 93573.47803
#> 4: 7.609289e+04 76935.45726
#> 5: 4.381261e+04 44053.75013
#> ---
#> 4952: 6.760895e+00 0.00000
#> 4953: 7.221771e+00 15.49070
#> 4954: 3.327218e+03 0.00000
#> 4955: 2.518465e+01 31.12184
#> 4956: 0.000000e+00 3328.19867
#> 200826_PTY087I2_250umce2_r3 200826_PTY087I2_250umce3_r1
#> <num> <num>
#> 1: 2.929342e+05 2.965014e+05
#> 2: 1.581785e+05 1.587843e+05
#> 3: 9.163425e+04 1.096399e+05
#> 4: 7.578763e+04 8.364374e+04
#> 5: 4.269494e+04 4.389627e+04
#> ---
#> 4952: 0.000000e+00 1.476297e+01
#> 4953: 9.325017e+00 5.648035e-03
#> 4954: 2.368871e-02 1.814666e+03
#> 4955: 3.390672e+01 3.206222e+01
#> 4956: 0.000000e+00 0.000000e+00
#> 200826_PTY087I2_250umce3_r2 200826_PTY087I2_250umce3_r3 200826_media1_r1
#> <num> <num> <num>
#> 1: 285505.55760 2.974424e+05 2.554827e+05
#> 2: 156528.89670 1.580228e+05 1.705910e+05
#> 3: 105935.69660 1.093869e+05 0.000000e+00
#> 4: 82556.23968 8.342072e+04 0.000000e+00
#> 5: 40825.92921 4.428666e+04 3.381575e+04
#> ---
#> 4952: 0.00000 0.000000e+00 5.259717e+01
#> 4953: 0.00000 1.223522e+00 4.605544e+00
#> 4954: 0.00000 0.000000e+00 1.498701e+03
#> 4955: 31.94794 4.173415e+01 7.619112e+03
#> 4956: 2671.52999 0.000000e+00 2.386338e+02
#> 200826_media1_r2 200826_media1_r3 200826_media2_r1 200826_media2_r2
#> <num> <num> <num> <num>
#> 1: 2.599233e+05 256021.50590 256086.16490 2.547825e+05
#> 2: 1.726717e+05 170716.05390 168874.36980 1.676367e+05
#> 3: 0.000000e+00 0.00000 0.00000 0.000000e+00
#> 4: 1.431221e-02 0.00000 0.00000 0.000000e+00
#> 5: 3.397754e+04 33169.47338 34372.98871 3.400355e+04
#> ---
#> 4952: 0.000000e+00 83.52441 0.00000 0.000000e+00
#> 4953: 1.084082e+01 2.18883 8.67406 3.569775e+00
#> 4954: 0.000000e+00 0.00000 18.25012 0.000000e+00
#> 4955: 8.962027e+00 48.01832 13.24781 1.311910e+01
#> 4956: 2.095771e+02 0.00000 0.00000 2.203997e+02
#> 200826_media2_r3 200826_media3_r1 200826_media3_r2 200826_media3_r3
#> <num> <num> <num> <num>
#> 1: 2.565244e+05 2.601417e+05 2.526247e+05 2.582542e+05
#> 2: 1.692963e+05 1.695131e+05 1.672526e+05 1.707665e+05
#> 3: 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
#> 4: 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
#> 5: 3.396147e+04 3.518284e+04 3.317613e+04 3.422295e+04
#> ---
#> 4952: 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
#> 4953: 3.475497e+00 2.908507e+00 7.279138e+00 1.909397e+00
#> 4954: 0.000000e+00 1.539513e+03 3.443881e+01 6.352884e+01
#> 4955: 1.589438e+01 2.351233e+01 1.777778e+01 2.357445e+01
#> 4956: 0.000000e+00 0.000000e+00 7.790370e+02 0.000000e+00
#> kmd
#> <num>
#> 1: 0.1520828
#> 2: 0.0803421
#> 3: 0.2189760
#> 4: 0.0872034
#> 5: 0.2063013
#> ---
#> 4952: 0.1658522
#> 4953: 0.6828202
#> 4954: 0.2131724
#> 4955: 0.5170432
#> 4956: 0.1028370
You can extract the peak table or metadata at any point with
get_raw_data()
, get_peak_table()
and
get_meta_data()
functions. Both functions will return a
data.table
object with the corresponding information.
The raw peak table is the unfiltered peak table used as input to
mpactr. To extract the raw input peak table, use the function
get_raw_data()
.
get_raw_data(data)[1:5, 1:8]
#> Compound mz rt 200826_blank1_r1 200826_blank1_r2
#> <char> <num> <num> <num> <num>
#> 1: 0.80_418.1451n 419.1521 0.8027667 0 0
#> 2: 0.81_210.0803m/z 210.0803 0.8099167 0 0
#> 3: 4.11_444.1061n 889.2190 4.1109833 0 0
#> 4: 4.11_400.0799n 401.0872 4.1109833 0 0
#> 5: 0.80_627.2171n 650.2063 0.8027667 0 0
#> 200826_blank1_r3 200826_PTY087I2_0umce1_r1 200826_PTY087I2_0umce1_r2
#> <num> <num> <num>
#> 1: 0 302024.40 305662.05
#> 2: 0 165582.52 166833.63
#> 3: 0 34018.69 35301.62
#> 4: 0 43581.87 43705.43
#> 5: 0 43326.39 43882.68
The raw peak table will not change as filters are applied to the
data. If you want to extract the filtered peak table, with filters that
have been applied, use get_peak_table()
:
get_peak_table(data)[1:5, 1:8]
#> Compound mz rt 200826_blank1_r1 200826_blank1_r2
#> <char> <num> <num> <num> <num>
#> 1: 0.80_418.1451n 419.1521 0.8027667 0 0
#> 2: 0.81_210.0803m/z 210.0803 0.8099167 0 0
#> 3: 4.11_444.1061n 889.2190 4.1109833 0 0
#> 4: 4.11_400.0799n 401.0872 4.1109833 0 0
#> 5: 0.80_627.2171n 650.2063 0.8027667 0 0
#> 200826_blank1_r3 200826_PTY087I2_0umce1_r1 200826_PTY087I2_0umce1_r2
#> <num> <num> <num>
#> 1: 0 302024.40 305662.05
#> 2: 0 165582.52 166833.63
#> 3: 0 34018.69 35301.62
#> 4: 0 43581.87 43705.43
#> 5: 0 43326.39 43882.68
Metadata can be accessed with get_meta_data()
:
get_meta_data(data)[1:5, ]
#> Injection File Text Sample_Notes MS method
#> <char> <lgcl> <lgcl> <char>
#> 1: 200826_blank1_r1 NA NA RMS20181105_Balunas_11min_mse
#> 2: 200826_blank1_r2 NA NA RMS20181105_Balunas_11min_mse
#> 3: 200826_blank1_r3 NA NA RMS20181105_Balunas_11min_mse
#> 4: 200826_media1_r1 NA NA RMS20181105_Balunas_11min_mse
#> 5: 200826_media1_r2 NA NA RMS20181105_Balunas_11min_mse
#> LC method Vial_Position
#> <char> <hms>
#> 1: jlb20170623_Balunas_11minRogermethod_450uLmin_5to98_col3_noPDA 01:16:00
#> 2: jlb20170623_Balunas_11minRogermethod_450uLmin_5to98_col3_noPDA 01:16:00
#> 3: jlb20170623_Balunas_11minRogermethod_450uLmin_5to98_col3_noPDA 01:16:00
#> 4: jlb20170623_Balunas_11minRogermethod_450uLmin_5to98_col3_noPDA 01:01:00
#> 5: jlb20170623_Balunas_11minRogermethod_450uLmin_5to98_col3_noPDA 01:01:00
#> Injection volume Sample_Code Organism Biological_Group Extract_Notes
#> <num> <char> <char> <char> <char>
#> 1: 10 Blank 50% MeOH Blanks 50% MeOH
#> 2: 10 Blank 50% MeOH Blanks 50% MeOH
#> 3: 10 Blank 50% MeOH Blanks 50% MeOH
#> 4: 10 MB1109 GYM_Media Media GYM_Media
#> 5: 10 MB1109 GYM_Media Media GYM_Media
mpactr is built on an R6 class-system, meaning it operates on reference semantics in which data is updated in-place. Compared to a shallow copy, where only data pointers are copied, or a deep copy, where the entire data object is copied in memory, any changes to the original data object, regardless if they are assigned to a new object, result in changes to the original data object. We can see this below.
Where the raw data object has 4956 ions in the feature table:
data2 <- import_data(peak_table = example_path("PTY087I2_dataset.csv"),
meta_data = meta_data,
format = "Progenesis"
)
get_peak_table(data2)[, 1:5]
#> Compound mz rt 200826_blank1_r1 200826_blank1_r2
#> <char> <num> <num> <num> <num>
#> 1: 0.80_418.1451n 419.1521 0.8027667 0.00000 0.000000
#> 2: 0.81_210.0803m/z 210.0803 0.8099167 0.00000 0.000000
#> 3: 4.11_444.1061n 889.2190 4.1109833 0.00000 0.000000
#> 4: 4.11_400.0799n 401.0872 4.1109833 0.00000 0.000000
#> 5: 0.80_627.2171n 650.2063 0.8027667 0.00000 0.000000
#> ---
#> 4952: 8.95_740.1659m/z 740.1659 8.9477667 50.86249 0.000000
#> 4953: 9.94_801.6828m/z 801.6828 9.9434667 11.08239 1.229457
#> 4954: 8.45_702.2132m/z 702.2132 8.4499000 257.47905 38.155170
#> 4955: 8.35_664.5170m/z 664.5170 8.3506167 284.44529 291.579543
#> 4956: 8.88_467.1028m/z 467.1028 8.8770500 2150.74323 0.000000
We can run the filter_mispicked_ions
filter, with
default setting copy_object = FALSE
(operates on reference
semantics).
data2_mispicked <- filter_mispicked_ions(data2,
ringwin = 0.5,
isowin = 0.01, trwin = 0.005,
max_iso_shift = 3, merge_peaks = TRUE,
merge_method = "sum",
copy_object = FALSE
)
#> ℹ Checking 4956 peaks for mispicked peaks.
#> ℹ Argument merge_peaks is: TRUE. Merging mispicked peaks with method sum.
#> ✔ 270 ions failed the mispicked filter, 4686 ions remain.
get_peak_table(data2_mispicked)[, 1:5]
#> Key: <Compound, mz, kmd, rt>
#> Compound mz kmd rt
#> <char> <num> <num> <num>
#> 1: 0.03_102.1549m/z 102.15495 0.15494700 0.03431667
#> 2: 0.04_113.9641m/z 113.96412 0.96411580 0.04146667
#> 3: 0.06_182.9860m/z 182.98601 0.98600990 0.05573333
#> 4: 0.06_235.9635m/z 235.96353 0.96353360 0.05573333
#> 5: 0.09_68.9936m/z 68.99356 0.99355728 0.09145000
#> ---
#> 4682: 9.99_463.1718m/z 463.17181 0.17180700 9.98643333
#> 4683: 9.99_480.7214m/z 480.72136 0.72135840 9.98643333
#> 4684: 9.99_87.5099m/z 87.50993 0.50992700 9.99358333
#> 4685: 9.99_92.9671m/z 92.96709 0.96708947 9.98643333
#> 4686: 9.99_95.0612m/z 95.06120 0.06119876 9.99358333
#> 200826_PTY087I2_0umce1_r1
#> <num>
#> 1: 0.00000
#> 2: 177.21710
#> 3: 5632.81768
#> 4: 17.36759
#> 5: 13.93727
#> ---
#> 4682: 85.71503
#> 4683: 0.00000
#> 4684: 26.94538
#> 4685: 46.99969
#> 4686: 31.83990
This results in 4686 ions in the feature table (above). Even though
we created an object called data2_mispicked
, the original
data2
object was also updated and now has 4686 ions in the
feature table:
get_peak_table(data2)[, 1:5]
#> Key: <Compound, mz, kmd, rt>
#> Compound mz kmd rt
#> <char> <num> <num> <num>
#> 1: 0.03_102.1549m/z 102.15495 0.15494700 0.03431667
#> 2: 0.04_113.9641m/z 113.96412 0.96411580 0.04146667
#> 3: 0.06_182.9860m/z 182.98601 0.98600990 0.05573333
#> 4: 0.06_235.9635m/z 235.96353 0.96353360 0.05573333
#> 5: 0.09_68.9936m/z 68.99356 0.99355728 0.09145000
#> ---
#> 4682: 9.99_463.1718m/z 463.17181 0.17180700 9.98643333
#> 4683: 9.99_480.7214m/z 480.72136 0.72135840 9.98643333
#> 4684: 9.99_87.5099m/z 87.50993 0.50992700 9.99358333
#> 4685: 9.99_92.9671m/z 92.96709 0.96708947 9.98643333
#> 4686: 9.99_95.0612m/z 95.06120 0.06119876 9.99358333
#> 200826_PTY087I2_0umce1_r1
#> <num>
#> 1: 0.00000
#> 2: 177.21710
#> 3: 5632.81768
#> 4: 17.36759
#> 5: 13.93727
#> ---
#> 4682: 85.71503
#> 4683: 0.00000
#> 4684: 26.94538
#> 4685: 46.99969
#> 4686: 31.83990
We recommend using the default copy_object = FALSE
as
this makes for an extremely fast and memory-efficient way to chain
mpactr filters together (see Chaining filters together
section and Reference
Semantics); however, if you would like to run the filters
individually with traditional R style objects, you can set
copy_object
to TRUE
as shown in the filter
examples.
mpactr provides filters to correct for the following issues observed during preprocessing of tandem MS/MS data:
To check for mispicked ions, use mpactr function
filter_mispicked_ions()
. This function takes an
mpactr object
as input, and checks for similar ions with
the arguments ringwin
, isowin
,
trwin
and max_iso_shift
.
Ions in the feature table are flagged as similar based on retention time and mass. Flagged ion groups are suggested to be the result of incorrect splitting of isotopic patterns during peak picking, detector saturation artifacts, or incorrect identification of multiply charged oligomers.
data_mispicked <- filter_mispicked_ions(data,
ringwin = 0.5,
isowin = 0.01, trwin = 0.005,
max_iso_shift = 3, merge_peaks = TRUE,
merge_method = "sum",
copy_object = TRUE
)
#> ℹ Checking 4956 peaks for mispicked peaks.
#> ℹ Argument merge_peaks is: TRUE. Merging mispicked peaks with method sum.
#> ✔ 270 ions failed the mispicked filter, 4686 ions remain.
Each filter reports progress of filtering, here we can see that 4956 ions were present prior to checking for mispicked ions. 270 ions were found to be similar to another ion and following merging, 4686 ions remain.
If you are interested in the groups of similar ions flagged in this
filter, you can use get_similar_ions()
. This function
returns a data.table
reporting the main ion (the ion
retained post-merging) and the ions similar to it.
head(get_similar_ions(data_mispicked))
#> main_ion similar_ions
#> <char> <list>
#> 1: 9.97_83.5172m/z 9.97_84.9592m/z
#> 2: 9.97_91.0082m/z 9.97_91.5049m/z,9.97_93.5039m/z
#> 3: 1.99_91.0545m/z 1.99_93.0679m/z
#> 4: 2.00_103.0545m/z 2.00_105.0674m/z
#> 5: 9.99_106.0043m/z 9.99_108.0234m/z
#> 6: 9.97_111.0207m/z 9.97_111.5219m/z,9.97_112.5194m/z
Removing solvent blank impurities is important for correcting for
between-sample carryover and contamination in experimental samples. You
can identify and remove these ions with mpactr’s
filter_group()
function. filter_group()
identifies ions above a relative abundance threshold
(group_threshold
) in a specific group
(group_to_remove
). To remove solvent blank impurities, set
group_to_remove
to the Biological_Group
in
your metadata file which corresponds to your solvent blank samples, here
“Blanks”.
data_blank <- filter_group(data,
group_threshold = 0.01,
group_to_remove = "Blanks", remove_ions = TRUE,
copy_object = TRUE
)
#> ℹ Parsing 4956 peaks based on the sample group: Blanks.
#> ℹ Argument remove_ions is: TRUE.Removing peaks from Blanks.
#> ✔ 1822 ions failed the Blanks filter, 3134 ions remain.
In this example, 4956 ions were present prior to the group filter. 1822 ions were found to be above the relative abundance threshold of 0.01 in “Solvent_Blank” samples, leaving 3134 ions in the peak table.
We can also use this filter to remove ions from other groups, such as
media blanks. This can be useful for experiments on cell cultures. The
example data contains samples belonging to the
Biological_Group
“Media”. These samples are from media
blanks, which are negative controls from the growth experiments
conducted in this study. We can remove features whose abundance is
greater than 1% of the largest group in media blank samples by
specifying group_to_remove
= “Media”. We recommend removing
media blank ions following all other filters so all high-quality ions
are identified (see Chaining filters together below).
data_media_blank <- filter_group(data,
group_threshold = 0.01,
group_to_remove = "Media", remove_ions = TRUE,
copy_object = TRUE
)
#> ℹ Parsing 4956 peaks based on the sample group: Media.
#> ℹ Argument remove_ions is: TRUE.Removing peaks from Media.
#> ✔ 3936 ions failed the Media filter, 1020 ions remain.
Ions whose abundance are not consisent between technical replicates
(i.e., non-reproducible) may not be reliable for analysis and
therefore should be removed from the feature table. Non-reproducible
ions are identified by mean or median coefficient of variation (CV)
between technical replicates with filter_cv()
. Note - this
filter cannot be applied to data that does not contain technical
replicates.
data_rep <- filter_cv(data,
cv_threshold = 0.2, cv_param = "median",
copy_object = TRUE
)
#> ℹ Parsing 4956 peaks for replicability across technical replicates.
#> ✔ 2179 ions failed the cv_filter filter, 2777 ions remain.
In our example dataset, 2179 ions were flagged as non-reproducible. These ions were removed, leaving 2777 ions in the feature table.
If you would like to visualize how the CV threshold performed on your
dataset, you can extract the CV calculated during
filer_cv()
using mpactr’s get_cv_data()
function, and calculate the percentage of features for plotting. You can
look at both mean and median CV as shown in the example below, or you
can filter the data by the parameter of choice.
cv <- get_cv_data(data_rep) %>%
pivot_longer(cols = c("mean_cv", "median_cv"),
names_to = "param",
values_to = "cv") %>%
nest(.by = param) %>%
mutate(
data = map(data, arrange, cv),
data = map(data, mutate, index = 0:(length(cv) - 1)),
data = map(data, mutate, index_scale = index * 100 / length(cv))
)
head(cv)
#> # A tibble: 2 × 2
#> param data
#> <chr> <list>
#> 1 mean_cv <tibble [4,956 × 4]>
#> 2 median_cv <tibble [4,956 × 4]>
The nested data are tibbles with the columns Compound, cv, index, index_scale:
#> # A tibble: 6 × 4
#> Compound cv index index_scale
#> <chr> <dbl> <int> <dbl>
#> 1 0.81_210.0803m/z 0.00636 0 0
#> 2 10.02_89.5072m/z 0.00686 1 0.0202
#> 3 10.02_110.0204m/z 0.00831 2 0.0404
#> 4 0.72_203.9632n 0.00928 3 0.0605
#> 5 0.80_418.1451n 0.0100 4 0.0807
#> 6 1.96_293.6773m/z 0.0104 5 0.101
There is one tibble for each parameter (either mean or median). We also want to calculate the percentage of features represented by the CV threshold.
cv_thresh_percent <- cv %>%
filter(param == "median_cv") %>%
unnest(cols = data) %>%
mutate(diff_cv_thresh = abs(cv - 0.2)) %>%
slice_min(diff_cv_thresh, n = 1) %>%
pull(index_scale)
cv_thresh_percent
#> [1] 56.03309
Then we can plot percentage of features by CV:
cv %>%
unnest(cols = data) %>%
mutate(param = factor(param, levels = c("mean_cv", "median_cv"),
labels = c("mean", "median"))) %>%
ggplot() +
aes(x = cv, y = index_scale, group = param, color = param) +
geom_line(linewidth = 2) +
geom_vline(xintercept = 0.2,
color = "darkgrey",
linetype = "dashed",
linewidth = 1) +
geom_hline(yintercept = cv_thresh_percent,
color = "darkgrey",
linewidth = 1) +
labs(x = "CV",
y = "Percentage of Features",
param = "Statistic") +
theme_bw() +
theme(
axis.title = element_text(size = 20),
axis.text = element_text(size = 15),
legend.position = "inside",
legend.position.inside = c(.90, .08),
legend.title = element_blank(),
legend.text = element_text(size = 15)
)
Here we can see that roughly 56% of features were below the CV threshold meaning 44% were removed at a CV threshold of 0.2.
Some mass species can be fragmented during ionization in tandem
MS/MS, creating insource ions. This can result in ions from one compound
being represented more than once in the feature table. If you would like
to remove insource ions fragments, you can do so with mpactr’s
filter_insource_ions()
. filter_insource_ions()
conducts ion deconvolution via retention time correlation matrices
within MS1 scans. Highly correlated ion groups are determined by the
cluster_threshold
parameter and filtered to remove the low
mass features. The highest mass feature is identified as the likely
precursor ion and retained in the feature table.
data_insource <- filter_insource_ions(data,
cluster_threshold = 0.95,
copy_object = TRUE
)
#> ℹ Parsing 4956 peaks for insource ions.
#> ✔ 600 ions failed the insource filter, 4356 ions remain.
600 ions were identified and removed during deconvolution of this dataset, leaving 4356 ions in the feature table.
Filters can be chained in a customizable workflow, shown below. While
filters can be chained in any order, we recommend filtering mispicked
ions, then solvent blanks, prior to filtering non-repoducible or
insource ions. This will allow for incorrectly picked peaks to be merged
and any contamination/carryover removed prior to identifying
non-reproducible and insource fragment ions. Here we also demonstrate
the removal of media blank components with the
filter_group()
function after identification of
high-quality ions.
data <- import_data(peak_table = example_path("PTY087I2_dataset.csv"),
meta_data = meta_data,
format = "Progenesis"
)
data_filtered <- filter_mispicked_ions(data, merge_method = "sum") |>
filter_group(group_to_remove = "Blanks") |>
filter_cv(cv_threshold = 0.2, cv_param = "median") |>
filter_group(group_to_remove = "Media")
#> ℹ Checking 4956 peaks for mispicked peaks.
#> ℹ Argument merge_peaks is: TRUE. Merging mispicked peaks with method sum.
#> ✔ 270 ions failed the mispicked filter, 4686 ions remain.
#> ℹ Parsing 4686 peaks based on the sample group: Blanks.
#> ℹ Argument remove_ions is: TRUE.Removing peaks from Blanks.
#> ✔ 1741 ions failed the Blanks filter, 2945 ions remain.
#> ℹ Parsing 2945 peaks for replicability across technical replicates.
#> ✔ 1143 ions failed the cv_filter filter, 1802 ions remain.
#> ℹ Parsing 1802 peaks based on the sample group: Media.
#> ℹ Argument remove_ions is: TRUE.Removing peaks from Media.
#> ✔ 1239 ions failed the Media filter, 563 ions remain.
mpactr offers mutliple ways to view a summary of data filtering.
If you are interested in viewing the passing and failing ions for a
single filter, use the filter_summary()
function. You must
specify which filter you are intested in, either “mispicked”, “group”,
“replicability”, or “insource”.
Failed ions:
head(mispicked_summary$failed_ions, 100)
#> [1] "9.97_84.9592m/z" "9.97_91.5049m/z" "9.97_93.5039m/z"
#> [4] "1.99_93.0679m/z" "2.00_105.0674m/z" "9.99_108.0234m/z"
#> [7] "9.97_111.5219m/z" "9.97_112.5194m/z" "2.32_117.0586m/z"
#> [10] "2.00_120.0817m/z" "2.00_120.1345m/z" "2.00_120.1704m/z"
#> [13] "0.77_118.1157m/z" "1.99_120.3792m/z" "2.32_132.0815m/z"
#> [16] "0.67_131.5785m/z" "1.46_132.1331m/z" "1.21_133.0896n"
#> [19] "0.84_139.0507m/z" "2.32_143.0736m/z" "2.32_144.0816m/z"
#> [22] "0.62_143.5978m/z" "0.67_147.1134m/z" "3.92_151.0634m/z"
#> [25] "4.62_165.0704m/z" "2.00_166.4370m/z" "1.99_166.1491m/z"
#> [28] "4.12_176.0628m/z" "0.62_180.6223m/z" "10.11_183.0220m/z"
#> [31] "0.94_186.5569m/z" "1.21_189.1242m/z" "0.94_190.0465m/z"
#> [34] "3.61_193.0423m/z" "0.62_195.6240m/z" "1.19_196.0381n"
#> [37] "4.78_198.0854m/z" "1.21_201.0726m/z" "1.21_202.1205m/z"
#> [40] "2.07_202.1180m/z" "3.63_207.0634m/z" "0.81_210.1507m/z"
#> [43] "0.81_210.1997m/z" "0.81_210.2631m/z" "0.81_210.4740m/z"
#> [46] "2.94_212.1208m/z" "0.65_213.0191m/z" "2.07_217.1056m/z"
#> [49] "1.91_217.1553m/z" "1.20_218.1109m/z" "1.22_221.5316m/z"
#> [52] "1.21_223.0594m/z" "1.96_227.6359m/z" "1.21_227.1005m/z"
#> [55] "0.84_226.1553m/z" "3.72_227.1760m/z" "3.63_229.0865m/z"
#> [58] "1.21_231.1367m/z" "1.92_234.1403m/z" "1.22_237.1363m/z"
#> [61] "0.63_238.1461m/z" "2.35_237.1243m/z" "1.21_244.1664m/z"
#> [64] "0.66_244.1380m/z" "1.21_247.1298m/z" "0.67_123.5768n"
#> [67] "2.22_247.1295m/z" "4.12_250.5607m/z" "0.62_250.1370m/z"
#> [70] "0.62_252.1540m/z" "2.23_250.1917m/z" "1.82_252.1519m/z"
#> [73] "0.83_253.1626n" "0.75_258.1110m/z" "3.14_258.5443m/z"
#> [76] "0.66_257.6233m/z" "0.66_258.1473m/z" "0.75_261.0120m/z"
#> [79] "1.24_260.1969m/z" "3.96_263.5542m/z" "1.17_265.0928m/z"
#> [82] "0.64_271.9292m/z" "9.97_250.9284n" "1.21_273.1650m/z"
#> [85] "1.21_274.0933m/z" "2.35_275.6568m/z" "1.17_275.6450m/z"
#> [88] "3.43_278.0424m/z" "0.64_280.6631m/z" "1.89_283.6481m/z"
#> [91] "3.99_286.0659m/z" "0.64_288.1607m/z" "1.21_287.5791m/z"
#> [94] "1.17_143.5989n" "0.83_288.2030m/z" "1.20_289.6186m/z"
#> [97] "0.83_291.1546m/z" "0.64_292.4432m/z" "10.09_294.0010m/z"
#> [100] "1.21_294.5725m/z"
Passing ions:
head(mispicked_summary$passed_ions, 100)
#> [1] "0.03_102.1549m/z" "0.04_113.9641m/z" "0.06_182.9860m/z"
#> [4] "0.06_235.9635m/z" "0.09_68.9936m/z" "0.10_102.1549m/z"
#> [7] "0.13_102.1552m/z" "0.17_213.0188m/z" "0.23_102.1549m/z"
#> [10] "0.28_691.4629m/z" "0.33_102.1550m/z" "0.35_68.9936m/z"
#> [13] "0.39_102.1549m/z" "0.40_213.0184m/z" "0.41_145.0360m/z"
#> [16] "0.43_102.1549m/z" "0.45_345.2033m/z" "0.48_102.1548m/z"
#> [19] "0.53_102.1550m/z" "0.57_153.7767m/z" "0.57_201.7990m/z"
#> [22] "0.58_177.4624m/z" "0.59_144.4417m/z" "0.59_203.2244m/z"
#> [25] "0.59_250.8096m/z" "0.60_129.1388m/z" "0.60_146.1659m/z"
#> [28] "0.60_182.7792m/z" "0.60_220.1359m/z" "0.60_308.2184m/z"
#> [31] "0.60_397.4459m/z" "0.61_138.1082m/z" "0.61_173.6264m/z"
#> [34] "0.61_188.6307m/z" "0.61_195.1269m/z" "0.61_209.1333m/z"
#> [37] "0.61_210.1284m/z" "0.61_300.1594m/z" "0.62_129.5943m/z"
#> [40] "0.62_141.5828n" "0.62_143.0402m/z" "0.62_152.1111m/z"
#> [43] "0.62_178.1125m/z" "0.62_186.1072m/z" "0.62_187.6351m/z"
#> [46] "0.62_193.1150m/z" "0.62_202.6318m/z" "0.62_206.6184m/z"
#> [49] "0.62_207.1142m/z" "0.62_216.1416m/z" "0.62_216.6342m/z"
#> [52] "0.62_223.1375m/z" "0.62_224.1328m/z" "0.62_230.1451m/z"
#> [55] "0.62_237.6514m/z" "0.62_242.6282m/z" "0.62_247.8076m/z"
#> [58] "0.62_249.6332m/z" "0.62_252.6512m/z" "0.62_259.1503m/z"
#> [61] "0.62_267.1498m/z" "0.62_273.6578m/z" "0.62_281.1568m/z"
#> [64] "0.62_295.1650m/z" "0.62_312.1757m/z" "0.63_102.1551m/z"
#> [67] "0.63_119.4797m/z" "0.63_133.4769m/z" "0.63_153.9915m/z"
#> [70] "0.63_156.5933m/z" "0.63_166.1144m/z" "0.63_174.5051m/z"
#> [73] "0.63_192.1185m/z" "0.63_200.6172m/z" "0.63_214.9373m/z"
#> [76] "0.63_221.1212m/z" "0.63_228.6148m/z" "0.63_235.6279m/z"
#> [79] "0.63_245.1423m/z" "0.63_265.4417m/z" "0.63_268.1517m/z"
#> [82] "0.63_269.1494m/z" "0.63_270.1403n" "0.63_280.1518m/z"
#> [85] "0.63_283.9593m/z" "0.63_311.9542m/z" "0.64_102.0344m/z"
#> [88] "0.64_110.9997n" "0.64_151.0360m/z" "0.64_161.5051m/z"
#> [91] "0.64_177.7884m/z" "0.64_185.4569m/z" "0.64_190.4682m/z"
#> [94] "0.64_198.4570m/z" "0.64_200.1062m/z" "0.64_201.6398m/z"
#> [97] "0.64_209.6270m/z" "0.64_257.9271m/z" "0.64_259.6555m/z"
#> [100] "0.64_270.9272m/z"
If you set filter
to a filter name that you did not
apply to your data, an error message will be returned.
filter_summary(data_filtered, filter = "insource")
#> Error in `mpactr_object$get_log()`:
#> ! `filter` insource has not yet been applied to the data. Run the
#> corresponding filter function prior to extracting the summary.
If you want to retrieve the filter summary for the group filter, you
must also supply the group name with the group
argument:
group
must always be supplied for this filter, even if
only one group filter has been applied.
You can view the filtering status of all input ions with the
qc_summary()
function. A data.table reporting the compound
id (compounds
) and if it failed or passed filtering. If the
ion failed filtering, its status will report the name of the filter it
failed.
head(qc_summary(data_filtered)[order(compounds), ])
#> status compounds
#> <char> <char>
#> 1: group-Blanks 0.03_102.1549m/z
#> 2: group-Blanks 0.04_113.9641m/z
#> 3: group-Blanks 0.06_182.9860m/z
#> 4: group-Blanks 0.06_235.9635m/z
#> 5: replicability 0.09_68.9936m/z
#> 6: group-Blanks 0.10_102.1549m/z
You can visualize filtering results with a tree map using the
filtering summary obtained from qc_summary()
and the
packages ggplot2
and treemapify
.
First, we need to determine the number of ions for each ion status in
the summary table. You can report the count; however, we need to
calculate the percent of ions in each group. We have done this in the
code chunk below, where we have used data.table
syntax as
the qc_summary()
returns a data.table
object.
If you are not familiar with the package data.table, check out their
resources on gitlab.
library(ggplot2)
library(treemapify)
ion_counts <- qc_summary(data_filtered)[, .(count = .N), by = status][
, percent := (count / sum(count) * 100)
]
Finally, we plot the treemap:
tm <- ggplot(ion_counts) +
aes(area = percent, fill = status) +
geom_treemap() +
geom_treemap_text(aes(
label = paste(status, count, paste0("(", round(percent, 2), "%)"),
sep = "\n"),
fontface = c("bold")
))
tm
This plot can be customized with ggplot2, for example we only want to display the percentage:
tm <- ggplot(ion_counts) +
aes(area = percent, fill = status) +
geom_treemap() +
geom_treemap_text(aes(
label = paste(status, paste0("(", round(percent, 2), "%)"), sep = "\n"),
fontface = c("bold")
))
tm
Or you no longer need the legend and maybe we want custom colors:
If you want a fast visualization of the treemap, you can pass the
mpactr object to the function plot_qc_tree()
.