Matching the Beat Annotations

In [1]:
using CSV, Plots, Images #, TimeSeries
using StatsBase
using Interact
using Knet
plotlyjs()

Plotly javascript loaded.

To load again call

init_notebook(true)

WARNING: Method definition ==(Base.Nullable{S}, Base.Nullable{T}) in module Base at nullable.jl:238 overwritten in module NullableArrays at /home/nsrl/juliapro/JuliaPro-0.6.1.1/JuliaPro/pkgs-0.6.1.1/v0.6/NullableArrays/src/operators.jl:99.
WARNING: using Knet.update! in module Main conflicts with an existing identifier.
Out[1]:
Plots.PlotlyJSBackend()
In [2]:
include("findpeaks.jl")
include("pan_tompkins.jl")
include("run_pan_tompkins.jl")
include("get_breaks.jl")
Out[2]:
get_breaks (generic function with 1 method)

So there are two main predicaments with taking the beats we have separated and converting them into a full dataset that one can do deep learning on.

1) The beats are different lengths, and that actually matters. The length of a beat can be indicative of if the person has an arrhythmia, and if so, what type

2) The annotations were done by hand, and they include non-beat annotations, so these annotations have to be correctly matched up to the identified peaks to infer truth labels.

In [3]:
#Test on arrhythmia sample
fs_arrhyth = 360;
sample_arrhyth = CSV.read("207.csv",header = ["Time", "ECG1", "ECG2"], datarow=1, nullable=false)
fs = fs_arrhyth;
x = sample_arrhyth[:,3];
t = sample_arrhyth[:,1];

x_HP, x_MAF, qrs_i, qrs_c, SIGL_buf, NOISL_buf, THRS_buf, qrs_i_raw, qrs_amp_raw, SIGL_buf1, NOISL_buf1, THRS_buf1 = run_pan_tompkins(x,t,fs);
breaks = get_breaks(t,qrs_i_raw);

Here's what the annotations look like - there are two files, one with the timestamps and the other with the actual annotation code. Each character represents a different type of beat or non-beat phenomenon. Our goal is to match the timestamps to the closest identified peak.

In [4]:
#Timestamps of the annotations
sample_ann = CSV.read("207_ann.csv",header = false,datarow = 1,nullable=false)
Out[4]:
Column1
110
251
3313
4836
51079
61589
71822
82343
92594
103112
113358
123858
134093
144638
154890
165410
175649
186161
196409
206940
217192
227701
237938
248438
258695
269216
279464
289950
2910197
3010690
In [5]:
#Annotations of the beats
sample_labels = CSV.read("207_anntype.csv",header = false,datarow = 1,nullable=false)
Out[5]:
Column1
1+
2R
3V
4R
5V
6R
7V
8R
9V
10R
11V
12R
13V
14R
15V
16R
17V
18R
19V
20R
21V
22R
23V
24R
25V
26R
27V
28R
29V
30R
In [6]:
#Convert to array
truth_labels = Vector{String}(size(qrs_i_raw))
for i = 1:length(qrs_i_raw)
    truth_labels[i] = sample_labels[indmin(abs.(qrs_i_raw[i] .- sample_ann[:,1])),1]
end

Converting to Dataset

I'm skipping a lot of the boring detail of converting this into a dataset, but basically it involves:

1) Creating a window of 1 second on either side of the peak, and fitting the identified beat into that. If the beat is smaller on either side than 1 second, then it will be centered in the window with a bunch of zeros on either side to pad it.

2) If either or both sides of the beat are longer than 1 second, they will be snipped to 1 second.

This strategy forces all the vectors that contain beats to be the same length, but still captures the actual underlying length of the beat. I also include some data category curation and data augmentation to try and even out the sizes of the classes.

In [7]:
include("get_dataset.jl")
Out[7]:
get_dataset (generic function with 1 method)
In [8]:
dataset1,truth_labels1 = get_dataset(207,360)
Out[8]:
([0.0 0.0 … 0.0846156 0.0228839; 0.0 0.0 … 0.0724476 0.0882155; … ; 0.0 0.0 … 0.101718 0.0711643; 0.0 0.0 … 0.0537974 0.0811386], String["R", "V", "R", "V", "R", "V", "R", "V", "R", "V"  …  "!", "!", "!", "!", "!", "!", "!", "!", "!", "!"])

For example, this dataset yields 2500+ beats.

In [9]:
size(dataset1), size(truth_labels1)
Out[9]:
((721, 2588), (2588,))

Each subject's data only contains some subset of the labels based on what heart condition they have.

In [10]:
unique(truth_labels1)
Out[10]:
4-element Array{String,1}:
 "R"
 "V"
 "L"
 "!"

Combining Data from Multiple Subjects

Again, I'm skipping some boring stuff, but I combine the datasets from several subjects and convert the truth labels into integers.

In [11]:
include("get_comb_dataset.jl")
Out[11]:
get_comb_dataset (generic function with 1 method)
In [12]:
abn_dataset, full_dataset, abn_truth_cats, full_bin_cats, label_key = get_comb_dataset([207, 212], 360)
size(abn_dataset), size(full_dataset), label_key
Out[12]:
((721, 3668), (721, 3849), String["R", "V", "L", "!"])

This is the total dataset I ended up using. It is from 37 subjects, and has 19000+ normal beats and 24000+ abnormal beats. There are two classification tasks I am interested in:

1) Normal vs abnormal (binary classification)

2) Within abnormal, different classes of abnormal beats (multi-class classification)

In [13]:
abn_dataset, full_dataset, abn_truth_cats, full_bin_cats, label_key = get_comb_dataset([207, 212, 203, 209, 201, 202, 205, 208, 210, 213, 220, 221, 222, 230, 111, 112, 113, 114, 115, 116, 117, 118, 119, 121, 122, 123, 124, 100, 101, 103, 104, 106, 108, 109, 232, 233, 234],360)
size(abn_dataset), size(full_dataset), size(abn_truth_cats), size(full_bin_cats), label_key
Out[13]:
((721, 24392), (721, 43656), (24392,), (43656,), String["R", "V", "L", "!", "A", "F", "/", "f"])

This shows the breakdown of types of abnormal beats only. It's not perfectly even, but the largest group is roughly one-sixth of the total, which is not super skewed.

In [14]:
countmap(label_key[abn_truth_cats])
Out[14]:
Dict{String,Int64} with 8 entries:
  "f" => 1404
  "A" => 4332
  "!" => 1380
  "/" => 4332
  "V" => 3096
  "L" => 3730
  "R" => 4348
  "F" => 1770

Converted to integers...

In [15]:
countmap(abn_truth_cats)
Out[15]:
Dict{Int64,Int64} with 8 entries:
  7 => 4332
  4 => 1380
  2 => 3096
  3 => 3730
  5 => 4332
  8 => 1404
  6 => 1770
  1 => 4348

Breakdown of normal vs abnormal beats... this is about a 45/55 split, so not perfectly even, but not bad.

In [16]:
countmap(full_bin_cats)
Out[16]:
Dict{Int64,Int64} with 2 entries:
  2 => 24392
  1 => 19264

This function just splits the dataset into training and testing sets, based on the given proportion of data to be test set.

In [17]:
include("get_traintest.jl")
Out[17]:
get_traintest (generic function with 1 method)

Separating into Training and Testing Sets

In [18]:
xtst_mc, ytst_mc, xtrn_mc, ytrn_mc, xtst_bin, ytst_bin, xtrn_bin, ytrn_bin = get_traintest(abn_dataset,full_dataset,abn_truth_cats,full_bin_cats,0.1);

For multi-class classification, here is the dataset: 2400+ in test set and almost 22000 in training set.

In [19]:
size(xtst_mc), size(ytst_mc), size(xtrn_mc), size(ytrn_mc)
Out[19]:
((721, 2439), (2439,), (721, 21953), (21953,))

For binary classification, here is the dataset: 4300+ in test set and 39000+ in training set.

In [20]:
size(xtst_bin), size(ytst_bin), size(xtrn_bin), size(ytrn_bin)
Out[20]:
((721, 4366), (4366,), (721, 39290), (39290,))