using CSV, Plots, Images #, TimeSeries
using StatsBase
using Interact
using Knet
plotlyjs()
include("findpeaks.jl")
include("pan_tompkins.jl")
include("run_pan_tompkins.jl")
include("get_breaks.jl")
So there are two main predicaments with taking the beats we have separated and converting them into a full dataset that one can do deep learning on.
1) The beats are different lengths, and that actually matters. The length of a beat can be indicative of if the person has an arrhythmia, and if so, what type
2) The annotations were done by hand, and they include non-beat annotations, so these annotations have to be correctly matched up to the identified peaks to infer truth labels.
#Test on arrhythmia sample
fs_arrhyth = 360;
sample_arrhyth = CSV.read("207.csv",header = ["Time", "ECG1", "ECG2"], datarow=1, nullable=false)
fs = fs_arrhyth;
x = sample_arrhyth[:,3];
t = sample_arrhyth[:,1];
x_HP, x_MAF, qrs_i, qrs_c, SIGL_buf, NOISL_buf, THRS_buf, qrs_i_raw, qrs_amp_raw, SIGL_buf1, NOISL_buf1, THRS_buf1 = run_pan_tompkins(x,t,fs);
breaks = get_breaks(t,qrs_i_raw);
Here's what the annotations look like - there are two files, one with the timestamps and the other with the actual annotation code. Each character represents a different type of beat or non-beat phenomenon. Our goal is to match the timestamps to the closest identified peak.
#Timestamps of the annotations
sample_ann = CSV.read("207_ann.csv",header = false,datarow = 1,nullable=false)
#Annotations of the beats
sample_labels = CSV.read("207_anntype.csv",header = false,datarow = 1,nullable=false)
#Convert to array
truth_labels = Vector{String}(size(qrs_i_raw))
for i = 1:length(qrs_i_raw)
truth_labels[i] = sample_labels[indmin(abs.(qrs_i_raw[i] .- sample_ann[:,1])),1]
end
I'm skipping a lot of the boring detail of converting this into a dataset, but basically it involves:
1) Creating a window of 1 second on either side of the peak, and fitting the identified beat into that. If the beat is smaller on either side than 1 second, then it will be centered in the window with a bunch of zeros on either side to pad it.
2) If either or both sides of the beat are longer than 1 second, they will be snipped to 1 second.
This strategy forces all the vectors that contain beats to be the same length, but still captures the actual underlying length of the beat. I also include some data category curation and data augmentation to try and even out the sizes of the classes.
include("get_dataset.jl")
dataset1,truth_labels1 = get_dataset(207,360)
For example, this dataset yields 2500+ beats.
size(dataset1), size(truth_labels1)
Each subject's data only contains some subset of the labels based on what heart condition they have.
unique(truth_labels1)
Again, I'm skipping some boring stuff, but I combine the datasets from several subjects and convert the truth labels into integers.
include("get_comb_dataset.jl")
abn_dataset, full_dataset, abn_truth_cats, full_bin_cats, label_key = get_comb_dataset([207, 212], 360)
size(abn_dataset), size(full_dataset), label_key
This is the total dataset I ended up using. It is from 37 subjects, and has 19000+ normal beats and 24000+ abnormal beats. There are two classification tasks I am interested in:
1) Normal vs abnormal (binary classification)
2) Within abnormal, different classes of abnormal beats (multi-class classification)
abn_dataset, full_dataset, abn_truth_cats, full_bin_cats, label_key = get_comb_dataset([207, 212, 203, 209, 201, 202, 205, 208, 210, 213, 220, 221, 222, 230, 111, 112, 113, 114, 115, 116, 117, 118, 119, 121, 122, 123, 124, 100, 101, 103, 104, 106, 108, 109, 232, 233, 234],360)
size(abn_dataset), size(full_dataset), size(abn_truth_cats), size(full_bin_cats), label_key
This shows the breakdown of types of abnormal beats only. It's not perfectly even, but the largest group is roughly one-sixth of the total, which is not super skewed.
countmap(label_key[abn_truth_cats])
Converted to integers...
countmap(abn_truth_cats)
Breakdown of normal vs abnormal beats... this is about a 45/55 split, so not perfectly even, but not bad.
countmap(full_bin_cats)
This function just splits the dataset into training and testing sets, based on the given proportion of data to be test set.
include("get_traintest.jl")
xtst_mc, ytst_mc, xtrn_mc, ytrn_mc, xtst_bin, ytst_bin, xtrn_bin, ytrn_bin = get_traintest(abn_dataset,full_dataset,abn_truth_cats,full_bin_cats,0.1);
For multi-class classification, here is the dataset: 2400+ in test set and almost 22000 in training set.
size(xtst_mc), size(ytst_mc), size(xtrn_mc), size(ytrn_mc)
For binary classification, here is the dataset: 4300+ in test set and 39000+ in training set.
size(xtst_bin), size(ytst_bin), size(xtrn_bin), size(ytrn_bin)