Matching the Beat Annotations¶

using CSV, Plots, Images #, TimeSeries
using StatsBase
using Interact
using Knet
plotlyjs()

init_notebook(true)

WARNING: Method definition ==(Base.Nullable{S}, Base.Nullable{T}) in module Base at nullable.jl:238 overwritten in module NullableArrays at /home/nsrl/juliapro/JuliaPro-0.6.1.1/JuliaPro/pkgs-0.6.1.1/v0.6/NullableArrays/src/operators.jl:99.
WARNING: using Knet.update! in module Main conflicts with an existing identifier.

Plots.PlotlyJSBackend()

include("findpeaks.jl")
include("pan_tompkins.jl")
include("run_pan_tompkins.jl")
include("get_breaks.jl")

get_breaks (generic function with 1 method)

So there are two main predicaments with taking the beats we have separated and converting them into a full dataset that one can do deep learning on.

1) The beats are different lengths, and that actually matters. The length of a beat can be indicative of if the person has an arrhythmia, and if so, what type

2) The annotations were done by hand, and they include non-beat annotations, so these annotations have to be correctly matched up to the identified peaks to infer truth labels.

#Test on arrhythmia sample
fs_arrhyth = 360;
sample_arrhyth = CSV.read("207.csv",header = ["Time", "ECG1", "ECG2"], datarow=1, nullable=false)
fs = fs_arrhyth;
x = sample_arrhyth[:,3];
t = sample_arrhyth[:,1];

x_HP, x_MAF, qrs_i, qrs_c, SIGL_buf, NOISL_buf, THRS_buf, qrs_i_raw, qrs_amp_raw, SIGL_buf1, NOISL_buf1, THRS_buf1 = run_pan_tompkins(x,t,fs);
breaks = get_breaks(t,qrs_i_raw);

Here's what the annotations look like - there are two files, one with the timestamps and the other with the actual annotation code. Each character represents a different type of beat or non-beat phenomenon. Our goal is to match the timestamps to the closest identified peak.

#Timestamps of the annotations
sample_ann = CSV.read("207_ann.csv",header = false,datarow = 1,nullable=false)

#Annotations of the beats
sample_labels = CSV.read("207_anntype.csv",header = false,datarow = 1,nullable=false)

#Convert to array
truth_labels = Vector{String}(size(qrs_i_raw))
for i = 1:length(qrs_i_raw)
    truth_labels[i] = sample_labels[indmin(abs.(qrs_i_raw[i] .- sample_ann[:,1])),1]
end

Converting to Dataset¶

I'm skipping a lot of the boring detail of converting this into a dataset, but basically it involves:

1) Creating a window of 1 second on either side of the peak, and fitting the identified beat into that. If the beat is smaller on either side than 1 second, then it will be centered in the window with a bunch of zeros on either side to pad it.

2) If either or both sides of the beat are longer than 1 second, they will be snipped to 1 second.

This strategy forces all the vectors that contain beats to be the same length, but still captures the actual underlying length of the beat. I also include some data category curation and data augmentation to try and even out the sizes of the classes.

include("get_dataset.jl")

get_dataset (generic function with 1 method)

dataset1,truth_labels1 = get_dataset(207,360)

([0.0 0.0 … 0.0846156 0.0228839; 0.0 0.0 … 0.0724476 0.0882155; … ; 0.0 0.0 … 0.101718 0.0711643; 0.0 0.0 … 0.0537974 0.0811386], String["R", "V", "R", "V", "R", "V", "R", "V", "R", "V"  …  "!", "!", "!", "!", "!", "!", "!", "!", "!", "!"])

For example, this dataset yields 2500+ beats.

size(dataset1), size(truth_labels1)

((721, 2588), (2588,))

Each subject's data only contains some subset of the labels based on what heart condition they have.

unique(truth_labels1)

4-element Array{String,1}:
 "R"
 "V"
 "L"
 "!"

Combining Data from Multiple Subjects¶

Again, I'm skipping some boring stuff, but I combine the datasets from several subjects and convert the truth labels into integers.

include("get_comb_dataset.jl")

get_comb_dataset (generic function with 1 method)

abn_dataset, full_dataset, abn_truth_cats, full_bin_cats, label_key = get_comb_dataset([207, 212], 360)
size(abn_dataset), size(full_dataset), label_key

((721, 3668), (721, 3849), String["R", "V", "L", "!"])

This is the total dataset I ended up using. It is from 37 subjects, and has 19000+ normal beats and 24000+ abnormal beats. There are two classification tasks I am interested in:

1) Normal vs abnormal (binary classification)

2) Within abnormal, different classes of abnormal beats (multi-class classification)

abn_dataset, full_dataset, abn_truth_cats, full_bin_cats, label_key = get_comb_dataset([207, 212, 203, 209, 201, 202, 205, 208, 210, 213, 220, 221, 222, 230, 111, 112, 113, 114, 115, 116, 117, 118, 119, 121, 122, 123, 124, 100, 101, 103, 104, 106, 108, 109, 232, 233, 234],360)
size(abn_dataset), size(full_dataset), size(abn_truth_cats), size(full_bin_cats), label_key

((721, 24392), (721, 43656), (24392,), (43656,), String["R", "V", "L", "!", "A", "F", "/", "f"])

This shows the breakdown of types of abnormal beats only. It's not perfectly even, but the largest group is roughly one-sixth of the total, which is not super skewed.

countmap(label_key[abn_truth_cats])

Dict{String,Int64} with 8 entries:
  "f" => 1404
  "A" => 4332
  "!" => 1380
  "/" => 4332
  "V" => 3096
  "L" => 3730
  "R" => 4348
  "F" => 1770

Converted to integers...

countmap(abn_truth_cats)

Dict{Int64,Int64} with 8 entries:
  7 => 4332
  4 => 1380
  2 => 3096
  3 => 3730
  5 => 4332
  8 => 1404
  6 => 1770
  1 => 4348

Breakdown of normal vs abnormal beats... this is about a 45/55 split, so not perfectly even, but not bad.

countmap(full_bin_cats)

Dict{Int64,Int64} with 2 entries:
  2 => 24392
  1 => 19264

This function just splits the dataset into training and testing sets, based on the given proportion of data to be test set.

include("get_traintest.jl")

get_traintest (generic function with 1 method)

Separating into Training and Testing Sets¶

xtst_mc, ytst_mc, xtrn_mc, ytrn_mc, xtst_bin, ytst_bin, xtrn_bin, ytrn_bin = get_traintest(abn_dataset,full_dataset,abn_truth_cats,full_bin_cats,0.1);

For multi-class classification, here is the dataset: 2400+ in test set and almost 22000 in training set.

size(xtst_mc), size(ytst_mc), size(xtrn_mc), size(ytrn_mc)

((721, 2439), (2439,), (721, 21953), (21953,))

For binary classification, here is the dataset: 4300+ in test set and 39000+ in training set.

size(xtst_bin), size(ytst_bin), size(xtrn_bin), size(ytrn_bin)

((721, 4366), (4366,), (721, 39290), (39290,))

	Column1
1	10
2	51
3	313
4	836
5	1079
6	1589
7	1822
8	2343
9	2594
10	3112
11	3358
12	3858
13	4093
14	4638
15	4890
16	5410
17	5649
18	6161
19	6409
20	6940
21	7192
22	7701
23	7938
24	8438
25	8695
26	9216
27	9464
28	9950
29	10197
30	10690
⋮	⋮

	Column1
1	+
2	R
3	V
4	R
5	V
6	R
7	V
8	R
9	V
10	R
11	V
12	R
13	V
14	R
15	V
16	R
17	V
18	R
19	V
20	R
21	V
22	R
23	V
24	R
25	V
26	R
27	V
28	R
29	V
30	R
⋮	⋮

	Column1
1	+
2	R
3	V
4	R
5	V
6	R
7	V
8	R
9	V
10	R
11	V
12	R
13	V
14	R
15	V
16	R
17	V
18	R
19	V
20	R
21	V
22	R
23	V
24	R
25	V
26	R
27	V
28	R
29	V
30	R
⋮	⋮

	Column1
1	+
2	R
3	V
4	R
5	V
6	R
7	V
8	R
9	V
10	R
11	V
12	R
13	V
14	R
15	V
16	R
17	V
18	R
19	V
20	R
21	V
22	R
23	V
24	R
25	V
26	R
27	V
28	R
29	V
30	R
⋮	⋮