MSI数据处理流程

Version:2.0

更新了imzML文件转为csv的过程

添加了Python预处理imzML文件的方法

一、质谱成像技术

MSI是一种将质谱分析与空间分辨相结合的分析技术,能够同时获得样品中分子的化学组成信息和空间分布信息,实现对样品表面分子的”可视化”分析。
MSI在空间代谢组学中已然成为热门技术,本文介绍的就是对MSI数据处理简要的流程。

二、MSI数据

常用的数据类型有哪些?

在质谱成像领域内,你大概率会见到的两种文件格式:
● Raw文件格式
● imzML原始数据文件格式
事实上不同厂家的质谱仪器生产出来的Raw质谱数据是不一样的,主流公司的数据格式如下表所示。

Raw和imzML分别含有哪些信息?

RAW保留更多原始信息:仪器状态日志、诊断信息、原始校准数据等等
mzML侧重核心数据:

在mzML文件下,mzML标签通常来说明多数与质谱数据无关的参数及过程,run中的spectrum标签则包含着谱图信息。其中的m/z array和intensity array就是最重要的数据。
imzML相比元数据包含x、y、z坐标,并且带有一个.ibd的二进制数据文件。从光谱维度,它的每一行对应一个空间位置的完整质谱图;从空间维度,它的每一列固定m/z值在所有位置的强度,用于生成我们的离子图像。

如何从Raw数据转换成imzML数据?

理论上我们只需要知道扫描斑点数信息,就可以直接从Raw文件中计算出质谱成像所需要的空间坐标信息。
本文侧重点在于对imzML文件的预处理,对于Raw文件可以用以下的链接提供的软件来进行Raw to imzML的转换,试着使用他给出的示例文件来进行练习。
https://www.ms-imaging.org/imzml/software-tools/raw-to-imzml-converter/
在后续更新中我会将这一步进行时的截图放上。

MALDI?DESI?

我们常见到的数据可能由MALDI(基质辅助激光解吸电离)和DESI(解吸电喷雾电离)两种技术得出。MALDI的空间分辨率高,但有着复杂的样品制备流程以及需要真空电离的条件。DESI则直接在大气压中进行电离,从而减少样品准备过程,但相对的空间分辨率低于MALDI。

三、imzML数据预处理(Cardinal)

Cardinal包安装

imzML文件的数据处理有很多不同的方式,在熟悉流程之后,你可以自己编写脚本来实现。
本文使用Cardinal包来进行预处理。Cardinal包支持MALDI和DESI的MSI工作,可以对生物样品进行基于质谱实验的统计分析。
在开始前,你需要自己安装好RStudio或其他IDE。安装Cardinal包,使用以下命令:

1
2
3
4
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")

BiocManager::install("Cardinal")

别忘了把刚安装好的包加载上:

1
2
# 加载包
library(Cardinal)

到这里如果没有报错,说明安装成功了,你可以查看官方文档来进行进一步的了解。

1
browseVignettes("Cardinal")

或者参考:

读取imzML文件

1
2
3
# 读取continuous数据
path_continuous <- file.path ("C:", "Users", "你的文件地址", fsep="\\")
mse <- readMSIData(path_continuous)

我们可以读取一些数据出来查看:

1
2
3
4
5
6
7
mz(mse)[1:10] # 提取前10个mz
run(mse)[1:10]# 提取前10个批次信息
featureData(mse)[1:10]# 提取前10个feature信息
pixelData(mse)[1:10]# 提取前10个像素信息
plot(mse, pixel=c(211))# 可视化单个像素,纵轴为强度,横轴为m/z
plot(mse, pixel=c(211, 611))# 可视化多个像素,纵轴为强度,横轴为m/z
image(mse,mz=1200)# 单m/z空间分布可视化,注意此处m/z可能在你的数据中不存在,选取你数据中有的m/z,可以运行后参考RStudio给出的提示

大致了解你手中的imzML数据后,可以开始进行数据预处理了。

预处理流程

高斯平滑去噪

高斯平滑可以减少随机噪声,提高信噪比。

1
2
3
4
5
mse_smoothed <- smoothSignal(mse,
method = "gaussian",
window = 5,
sd = 2,
units = "ppm")

调整大小适中的窗口,可以有效去噪同时保留峰形

谱对齐

由于仪器采集时条件不可能完全一致,产生一些偏移是必然的。初步对齐可以校正不同空间位置的质量偏移。

1
2
3
4
mse_aligned <- peakAlign(mse_smoothed,
tolerance = 1500,
units = "ppm",
BPPARAM = MulticoreParam())

增大tolerance可提高匹配率,但可能引入错误匹配

TIC标准化

TIC标准化可以校正不同像素点之间的整体强度差异。

1
2
3
mse_normalized <- normalize(mse_aligned,
method = "tic",
scale = TRUE)

peakpick选峰

一般来说,我们需要选取显著的质谱峰,减少数据维度。

1
mse_peaks <- peakPick(mse_normalized, method="filter", SNR=5)

增大SNR可减少假阳性,但可能丢失低丰度峰
调整peakWidth可适应不同的峰形特征

峰合并

合并相近的峰以减少冗余。

1
mse_done <- bin(mse_peaks,spectra="intensity",index="mz",method="linear", resolution=5, units="ppm")

改变resolution可改变合并的精度,但可能丢失一些峰。需要你根据实际情况调整。

查看处理结果

现在,你可以使用image函数来比较预处理前后的图像结果。

输出

1
2
3
4
#save
imzfile <- tempfile(fileext="你的文件名.imzML")
writeMSIData(mse_done, imzfile)
list.files(imzfile)

四、imzML转可处理文件

我们保存了预处理后的imzML文件,要进行后续的训练分析,则需要把这些文件转换为另一些可处理的文件。
这里我用了刘思扬师兄的Python代码来完成。

加载需要的库

1
2
3
4
5
6
7
8
import os
import numpy as np
import pandas as pd
from scipy import sparse
from scipy.ndimage import gaussian_filter1d
from pyimzml.ImzMLParser import ImzMLParser
from pyimzml.ImzMLWriter import ImzMLWriter
import csv

设置路径及读取imzML文件

首先要读取我们处理好的imzML文件,并设置好保存路径。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Define file paths
input_path = '/你的文件路径/xx.imzML'
output_path = '/你的保存路径/xx.imzML'
output_dir = "/你的保存路径"
sorted_peaklist_file = /你的peaklist保存路径/xx.txt'

# Create output directory if it doesn't exist
if not os.path.exists(output_dir):
os.makedirs(output_dir)

def __init__(self, mz, intensity):
self.mz = mz
self.intensity = intensity

# Process IMZML file
parser = ImzMLParser(input_path)
coordinates = []
mz_values_list = []
intensity_values_list = []

提取质谱峰数据并保存到csv文件

1
2
3
4
5
6
7
8
9
10
11
12
13
def extract_imzml_peaks(imzml_file, output_dir):
parser = ImzMLParser(imzml_file)
for i, (x, y, z) in enumerate(parser.coordinates, start=1):
mzs, intensities = parser.getspectrum(i - 1)
output_filename = os.path.join(output_dir, f"peaks_{i:05d}.csv")

with open(output_filename, 'w', newline='') as csvfile:
csvwriter = csv.writer(csvfile)
csvwriter.writerow(['mz', 'intensity'])
for mz, intensity in zip(mzs, intensities):
csvwriter.writerow([mz, intensity])

extract_imzml_peaks(output_path, output_dir)

保存位置信息到csv文件

1
2
3
4
# Save coordinates to CSV
imzml = ImzMLParser(output_path)
coordinates = np.array(imzml.coordinates)[:, :2]
np.savetxt('/Users/ForRiver/OneDrive/Desktop/Pre/BIONET/preprocess/compare-neg/cood-gz4_5.csv', coordinates, delimiter=',')

csv文件根据坐标去零

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Read sparse data and coordinates
data0 = sparse.load_npz('/Users/ForRiver/OneDrive/Desktop/Pre/BIONET/preprocess/compare-neg/gz4_5.npz').toarray()
data = data0[1:, :]
coordinates = np.loadtxt('/Users/ForRiver/OneDrive/Desktop/Pre/BIONET/preprocess/compare-neg/cood-gz4_5.csv', delimiter=',')

x, y = coordinates[:, 0].astype(int), coordinates[:, 1].astype(int)
output_mat = np.zeros((np.max(y) + 1, np.max(x) + 1, data.shape[1]))
for i in range(len(data)):
output_mat[y[i], x[i]] = data[i]

output_mat = np.array([list(output_mat[i]) for i in range(len(output_mat)) if np.sum(output_mat[i]) != 0])
output_mat = np.array([list(output_mat[:, i]) for i in range(len(output_mat[0])) if np.sum(output_mat[:, i]) != 0])
data = sparse.bsr_matrix(output_mat.reshape(output_mat.shape[0] * output_mat.shape[1], output_mat.shape[2]))
sparse.save_npz('/Users/ForRiver/OneDrive/Desktop/Pre/BIONET/preprocess/compare-neg/gz4_5_%d_%d.npz' % (output_mat.shape[0], output_mat.shape[1]), data)


print("Data has been saved.")

这样,我们就保存好了需要的可处理文件。你可以在文件夹中找到一系列的peaks.csv文件、坐标文件和peaklist。

五、Python进行预处理全流程

这部分的代码由组里提供,包含了上述的全流程。你可以更换所有的path来进行处理,具体不再解释。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
import os
import numpy as np
import pandas as pd
from scipy import sparse
from scipy.ndimage import gaussian_filter1d
from pyimzml.ImzMLParser import ImzMLParser
from pyimzml.ImzMLWriter import ImzMLWriter
import csv

# Define file paths
input_path = '/Users/ForRiver/OneDrive/Desktop/Pre/BIONET/preprocess/slice_1.imzML'
output_path = '/Users/ForRiver/OneDrive/Desktop/Pre/BIONET/preprocess/compare-neg/gz4_5.imzML'
output_dir = "/Users/ForRiver/OneDrive/Desktop/Pre/BIONET/preprocess/compare-neg/export-gz4_5/"
sorted_peaklist_file = '/Users/ForRiver/OneDrive/Desktop/Pre/BIONET/preprocess/compare-neg/peaklist-neg-gz4_5-sort.txt'

# Create output directory if it doesn't exist
if not os.path.exists(output_dir):
os.makedirs(output_dir)


class SpectrumProcessor:
def __init__(self, mz, intensity):
self.mz = mz
self.intensity = intensity

@staticmethod
def peak_picking(mz_array, intensity_array, snr=1, sigma=2.0):
"""
Identify peaks in the intensity array based on a signal-to-noise ratio (SNR) threshold.

:param mz_array: Array of m/z values.
:param intensity_array: Array of intensity values corresponding to the m/z values.
:param snr: Signal-to-noise ratio threshold for peak detection.
:param sigma: Standard deviation for Gaussian kernel used in smoothing.
:return: Arrays of m/z values and intensities for detected peaks.
"""
# Smooth the intensity array to reduce noise
smoothed_intensity = gaussian_filter1d(intensity_array, sigma=sigma)

# Calculate the mean and standard deviation of the smoothed intensity array
mean_intensity = np.mean(smoothed_intensity)
std_intensity = np.std(smoothed_intensity)

# Identify peaks where intensity is greater than mean + snr * std
peaks = smoothed_intensity > (mean_intensity + snr * std_intensity)

# Return the m/z values and intensities of the detected peaks
return mz_array[peaks], intensity_array[peaks]

def binning(self, Da):
"""
Binning algorithm for spectra.
:return: 1D numpy array, binned spectrum.
"""
bin_size = int(Da / (self.mz[1] - self.mz[0]))
bin_spectrum = np.mean(
np.pad(self.intensity, (0, bin_size - len(self.intensity) % bin_size), mode='constant', constant_values=0)
.reshape(-1, bin_size),
axis=1)

mz_values1 = self.mz[::bin_size]
return mz_values1, bin_spectrum


# Process IMZML file
parser = ImzMLParser(input_path)
coordinates = []
mz_values_list = []
intensity_values_list = []

for idx, (x, y, z) in enumerate(parser.coordinates):
mz_array, intensity_array = parser.getspectrum(idx)
mz_peaks, intensity_peaks = SpectrumProcessor.peak_picking(mz_array, intensity_array)
coordinates.append((x, y, z))
print(idx)
mz_values_list.append(mz_peaks)
intensity_values_list.append(intensity_peaks)

# Write processed data to new IMZML file
with ImzMLWriter(output_path) as writer:
for idx, (mz_peaks, intensity_peaks) in enumerate(zip(mz_values_list, intensity_values_list)):
x, y, z = coordinates[idx]
print(idx)
writer.addSpectrum(mz_peaks, intensity_peaks, (x, y, z))


# Extract peaks to CSV files
def extract_imzml_peaks(imzml_file, output_dir):
parser = ImzMLParser(imzml_file)
for i, (x, y, z) in enumerate(parser.coordinates, start=1):
mzs, intensities = parser.getspectrum(i - 1)
output_filename = os.path.join(output_dir, f"peaks_{i:05d}.csv")

with open(output_filename, 'w', newline='') as csvfile:
csvwriter = csv.writer(csvfile)
csvwriter.writerow(['mz', 'intensity'])
for mz, intensity in zip(mzs, intensities):
csvwriter.writerow([mz, intensity])


extract_imzml_peaks(output_path, output_dir)

# Save coordinates to CSV
imzml = ImzMLParser(output_path)
coordinates = np.array(imzml.coordinates)[:, :2]
np.savetxt('/Users/ForRiver/OneDrive/Desktop/Pre/BIONET/preprocess/compare-neg/cood-gz4_5.csv', coordinates, delimiter=',')


def peak_filter(PATH, min_frequency=0.005, type='MATLAB', file=' '):
if type == 'MATLAB':
path = PATH

m = len(os.listdir(path))
mz = []
for i in range(1, m + 1):
data = pd.read_csv(path + str(i), header=None)
data = data.values
mz.extend(list(data[:, 0]))
mzs = sorted(set(np.around(mz, 4)))
print('all mz number is ', len(mzs))
dict_total = {}
for i in mzs:
dict_total[i] = 0
for i in range(m):
data = pd.read_csv(path, str(i + 1), header=None)
data = data.values
_u = np.around(data[:, 0], 4) # 4 significant digits
for j in _u:
dict_total[j] += 1

i = 0
peak_list = []
for key, value in dict_total.items():
if value > m * min_frequency:
i += 1
peak_list.append(key)
print('filter peak is ', len(peak_list))
np.savetxt('peak_list3', peak_list)

output_mat = np.zeros((m, len(peak_list)))

for i in range(m):
print('processing ', i, 'piex')
data = pd.read_csv(path, str(i + 1), header=None)
data = data.values
_u = np.around(data[:, 0], 4) # 4 significant digits
for j in range(len(_u)):
if _u[j] in peak_list:
output_mat[i, peak_list.index(_u[j])] = data[j, 1]
return output_mat

if type == 'R':
path = PATH

m = len(os.listdir(path))
mz = []
for i in range(1, m):
if i % 100 == 0:
print(i)
k = '{:0>5d}'.format(i) # '{:0>5d}'
data = pd.read_csv(path + file + k + '.csv', header=None)
data = data.values[1:]
mz.extend(list(data[:, 0]))
mz = [float(x) for x in mz]
mzs = list(set(mz))
print('all mz number is ', len(mzs))
dict_total = {}
for i in mzs:
dict_total[i] = 0
for i in range(1, m):
if i % 100 == 0:
print(i)
k = '{:0>5d}'.format(i) # {:0>5d}
data = pd.read_csv(path + file + k + '.csv', header=None)
data = data.values[1:]
dataint = [float(x) for x in data[:, 0]]
dataint2 = [float(x) for x in data[:, 1]]
_u = dataint
threhold = np.percentile(dataint2, 95) * 0.001
for j in range(len(_u)):
if dataint2[j] >= threhold:
dict_total[_u[j]] += 1

i = 0
peak_list = []
for key, value in dict_total.items():
if value > m * min_frequency:
i += 1
peak_list.append(key)
print('after filter peak is ', len(peak_list))

output_mat = np.zeros((m, len(peak_list)))

for i in range(1, m):
k = '{:0>5d}'.format(i)
if i % 100 == 0:
print(i)
data = pd.read_csv(path + file + k + '.csv', header=None)
data = data.values[1:]
dataint = [float(x) for x in data[:, 0]]
_u = dataint
for j in range(len(_u)):
if _u[j] in peak_list:
output_mat[i, peak_list.index(_u[j])] = data[j, 1]
return output_mat, peak_list


# Filter peaks and save
output_mat, peak_list = peak_filter(output_dir, min_frequency=0.05, type='R', file='peaks_')
spa_data = sparse.bsr_matrix(output_mat)
sparse.save_npz('/Users/ForRiver/OneDrive/Desktop/Pre/BIONET/preprocess/compare-neg/gz4_5.npz', spa_data)
np.savetxt('/Users/ForRiver/OneDrive/Desktop/Pre/BIONET/preprocess/compare-neg/peaklist-gz4_5', peak_list)

# Read sparse data and coordinates
data0 = sparse.load_npz('/Users/ForRiver/OneDrive/Desktop/Pre/BIONET/preprocess/compare-neg/gz4_5.npz').toarray()
data = data0[1:, :]
coordinates = np.loadtxt('/Users/ForRiver/OneDrive/Desktop/Pre/BIONET/preprocess/compare-neg/cood-gz4_5.csv', delimiter=',')

x, y = coordinates[:, 0].astype(int), coordinates[:, 1].astype(int)
output_mat = np.zeros((np.max(y) + 1, np.max(x) + 1, data.shape[1]))
for i in range(len(data)):
output_mat[y[i], x[i]] = data[i]

output_mat = np.array([list(output_mat[i]) for i in range(len(output_mat)) if np.sum(output_mat[i]) != 0])
output_mat = np.array([list(output_mat[:, i]) for i in range(len(output_mat[0])) if np.sum(output_mat[:, i]) != 0])
data = sparse.bsr_matrix(output_mat.reshape(output_mat.shape[0] * output_mat.shape[1], output_mat.shape[2]))
sparse.save_npz('/Users/ForRiver/OneDrive/Desktop/Pre/BIONET/preprocess/compare-neg/gz4_5_%d_%d.npz' % (output_mat.shape[0], output_mat.shape[1]), data)

# Sort and save peak list
with open('/Users/ForRiver/OneDrive/Desktop/Pre/BIONET/preprocess/compare-neg/peaklist-gz4_5', 'r') as file:
lines = file.readlines()
lines = [line.strip() for line in lines]
lines.sort()
with open(sorted_peaklist_file, 'w') as file:
for line in lines:
file.write(f"{line}\n")
print("peaklist_saved")

# Reorder data columns based on sorted peak list
data0 = sparse.load_npz(
'/Users/ForRiver/OneDrive/Desktop/Pre/BIONET/preprocess/compare-neg/gz4_5_%d_%d.npz' % (output_mat.shape[0], output_mat.shape[1])).toarray()
with open('/Users/ForRiver/OneDrive/Desktop/Pre/BIONET/preprocess/compare-neg/peaklist-gz4_5', 'r') as f:
reference = [float(x.strip()) for x in f.readlines()]
with open(sorted_peaklist_file, 'r') as f:
target = [float(x.strip()) for x in f.readlines()]

indices = [target.index(value) if value in target else -1 for value in reference]
data = np.vstack([data0[:, idx] for idx in indices if idx != -1]).T
np.save('/Users/ForRiver/OneDrive/Desktop/Pre/BIONET/preprocess/compare-neg/gz4_5', data)

print("Data has been saved.")

写在最后

这篇文章主要用于自己学习MSI预处理步骤以及对BIONET新成员介绍imzML文件的预处理流程。目前处理过的数据还很少,大家在测试时候也可以自己多试试几个参数看看效果(虽然效果并不是特别明显)。如果有错误的烦请指正。


MSI数据处理流程
http://pleinelune-r.github.io/2025/08/05/MSI数据处理流程/
作者
Pleinelune
发布于
2025年8月5日
许可协议