De-duplicating clinic visits

Problem

We want to only keep an individual’s clinic visits which are for new infections.
Therefore, want to remove repeat visit to the clinic
where a repeat is defined as within 6 weeks of the previous visit.

Solutions

There are a number of way to tackle this problem which vary in terms of computation speed, simplicity and intuition.

To begin, load the packages we’re going to need:

library(reshape2)
library(dplyr)
library(readxl)
library(lubridate)
library(knitr)

Load in the data. This is in the form of individual patient data (IPD) clinic visit dates and ids.

load(file = here::here("cleaned_visit_date.RData"))
dat0 <- dat
kable(head(dat))

 uniq_iddateperson_id
67219F053090_2012-03-072012-03-07F053090
135647F105975_2014-05-212014-05-21F105975
21106716M03749_2016-04-182016-04-1816M03749
49805M091207_2012-03-272012-03-27M091207
39751F066543_2011-12-292011-12-29F066543
37228F108730_2011-12-022011-12-02F108730

Simple loops solution

We could simply step through the original data line by line.

library(dplyr)

keep_record <- c()
dat <- dat0
dat <-
  dat[1:5000, ] %>% 
  arrange(person_id, date)

previous_patient <- ""
previous_date <- 0

for (i in seq_len(nrow(dat))){

  person_id <- dat[i, "person_id"]
  date <- dat[i, "date"]

  # first visit
  if (person_id != previous_patient){
    # print("new id")
    keep <- TRUE
    previous_patient <- person_id
    previous_date <- date

  } else if ((date - previous_date) > 42){
    # print("over 42")
    keep <- TRUE
    previous_date <- date
  } else{

    keep <- FALSE
  }

  keep_record <- c(keep_record, keep)
}

table(keep_record)

## keep_record
## FALSE  TRUE 
##    31  4969

Tidyverse solution

So within a while loop we group by individuals and for the earliest date calculate a 6 week time window (end date window_end) and a flag whether or not subsequent clinic visits are within this (within_window).

We keep track of 2 output patient lists: repeat_visit_ids and first_visit_ids.
The later is just the id from the first date.
Once we have added to these list given the information from window_end then we remove the first date and the repeat visit associated with this visit.
Finally, we repeat the whole thing until we reach the end of the dataframe.

time_window <- weeks(6) #  duration(42, "days")          #6 weeks in days
repeat_visit_ids <- NULL   #initialise
first_visit_ids  <- NULL
dat <- dat0

while(nrow(dat) > 0) {

  dat <-
    dat %>%
    group_by(person_id) %>%
    dplyr::arrange(date, .by_group = TRUE) %>% 
    mutate(window_end = min(date) + time_window,
           within_window = date <= window_end,
           first_date = date == min(date)) %>% 
    ungroup()

  repeat_visit_ids <-
    dat %>%
    filter(within_window & !first_date) %>%
    transmute(uniq_id) %>%
    rbind(repeat_visit_ids, .)

  first_visit_ids <-
    dat %>%
    filter(first_date) %>%
    transmute(uniq_id) %>%
    rbind(first_visit_ids, .)

  dat <-
    dat %>%
    filter(!within_window)
}

The output is a list of unique visit ids.

kable(head(first_visit_ids))

uniq_id
14F00004_2014-05-07
14F00006_2014-05-07
14F00016_2014-05-07
14F00019_2014-05-07
14F00021_2014-05-07
14F00027_2014-06-12
nrow(first_visit_ids)

## [1] 90389

Pre-processed data structure solution

We could create a data structure for easier computation.

First, lets index the repeat visits for each patient by creating a ‘visit’ column with a counter.

dat <- dat0

dat <-
  dat %>%
  group_by(person_id) %>%
  arrange(date) %>%
  mutate(visit = row_number()) %>%
  arrange(person_id, visit)

kable(head(dat, n = 10))

uniq_iddateperson_idvisit
14F00004_2014-05-072014-05-0714F000041
14F00004_2014-05-272014-05-2714F000042
14F00004_2014-06-062014-06-0614F000043
14F00006_2014-05-072014-05-0714F000061
14F00016_2014-05-072014-05-0714F000161
14F00019_2014-05-072014-05-0714F000191
14F00021_2014-05-072014-05-0714F000211
14F00021_2014-06-112014-06-1114F000212
14F00027_2014-06-122014-06-1214F000271
14F00034_2014-06-062014-06-0614F000341

For any computation using this data set we can simply remove the single visit individuals because we already know about these.
This reduced the size of the dataframe by some way.

We can cast the data to a wide form to show a kind of patient trajectory such that the columns of visit counts and entries are dates (on a numeric scale).

yy <- dcast(person_id ~ visit,
            data = dat, value.var = "date")

yy <-
  yy %>%
  filter(!is.na(`2`))

kable(head(yy[ ,1:11], n = 10))

person_id12345678910
14F00004161971621716227NANANANANANANA
14F000211619716232NANANANANANANANA
14F00048161981620416484166181700117224174881774417831NA
14F00054161981620917080NANANANANANANA
14F0007116226162731638016394164081652716555NANANA
14F00073161981635917416NANANANANANANA
14F000751626117031NANANANANANANANA
14F001061635916378NANANANANANANANA
14F001081620216393NANANANANANANANA
14F00109162021623016237NANANANANANANA

Rather than this flat representation, we can split the data in to a list of visit frequencies.
That is each element in the list consist of trajectories for those individual with that number of visits.
The elements in the dataframes are the difference between the ith visit and all of the subsequent.
We’ll use a subsample just to keep the computation time down.
Note that I remove the patient id so that I can do compuation on all of the visit times before including it back in.

yy <- yy[1:1000, ]
visits_wide <- select(yy ,-person_id)
zz <- list()

for (i in seq_along(visits_wide)) {

  zz[[i]] <- visits_wide - visits_wide[, i]   # difference between origin visit and subsequent dates
  zz[[i]][zz[[i]] <= 0] <- NA                 # replace negative times

  # are all visits/ whole row NAs?
  all_NA  <- rowSums(is.na(zz[[i]])) != ncol(zz[[i]])
  zz[[i]]$person_id <- yy$person_id           # include patient id
  zz[[i]] <- zz[[i]][all_NA, ] # remove rows
}

lapply(zz[1:3], function(x) x[1:6, 1:11])

## [[1]]
##    1   2    3   4   5    6    7    8    9 10 11
## 1 NA  20   30  NA  NA   NA   NA   NA   NA NA NA
## 2 NA  35   NA  NA  NA   NA   NA   NA   NA NA NA
## 3 NA   6  286 420 803 1026 1290 1546 1633 NA NA
## 4 NA  11  882  NA  NA   NA   NA   NA   NA NA NA
## 5 NA  47  154 168 182  301  329   NA   NA NA NA
## 6 NA 161 1218  NA  NA   NA   NA   NA   NA NA NA
## 
## [[2]]
##     1  2    3   4   5    6    7    8    9 10 11
## 1  NA NA   10  NA  NA   NA   NA   NA   NA NA NA
## 3  NA NA  280 414 797 1020 1284 1540 1627 NA NA
## 4  NA NA  871  NA  NA   NA   NA   NA   NA NA NA
## 5  NA NA  107 121 135  254  282   NA   NA NA NA
## 6  NA NA 1057  NA  NA   NA   NA   NA   NA NA NA
## 10 NA NA    7  NA  NA   NA   NA   NA   NA NA NA
## 
## [[3]]
##     1  2  3   4   5   6    7    8    9 10 11
## 3  NA NA NA 134 517 740 1004 1260 1347 NA NA
## 5  NA NA NA  14  28 147  175   NA   NA NA NA
## 12 NA NA NA 151 785  NA   NA   NA   NA NA NA
## 15 NA NA NA 185  NA  NA   NA   NA   NA NA NA
## 18 NA NA NA  69 182  NA   NA   NA   NA NA NA
## 20 NA NA NA 161  NA  NA   NA   NA   NA NA NA

The reason for splitting the data in this way is that now doing operations on it to deduplicate are straightforward because its kind of already sorted.
The below was my first attempt at this approach.
It slow and pretty ugly.

out <- zz
from_visit_seq <- head(seq_along(zz), -1)

for (i in from_visit_seq){

  # only consider later columns (to the right)
  keep_cols <- names(out[[i]])[!names(out[[i]]) %in% (1:i)]
  future_visits <- out[[i]][ ,keep_cols]

  for (j in out[[i]]$person_id){

    if (nrow(out[[i]]) == 0) break

    future_id <- future_visits[future_visits$person_id == j, ]

    # which visit number (column name) is within time window (42 days) for each patient?
    times <- select(future_id, -person_id)
    visit_rm <- colnames(times)[times <= time_window & !is.na(times)]

    if (length(visit_rm) > 0) {

      # remove these repeat visits in list element
      for (k in as.numeric(visit_rm))
        out[[k]] <- out[[k]][out[[k]]$person_id != j, ]
    }
  }
}

lapply(out[1:3], function(x) x[1:6, 1:11])

## [[1]]
##    1   2    3   4   5    6    7    8    9 10 11
## 1 NA  20   30  NA  NA   NA   NA   NA   NA NA NA
## 2 NA  35   NA  NA  NA   NA   NA   NA   NA NA NA
## 3 NA   6  286 420 803 1026 1290 1546 1633 NA NA
## 4 NA  11  882  NA  NA   NA   NA   NA   NA NA NA
## 5 NA  47  154 168 182  301  329   NA   NA NA NA
## 6 NA 161 1218  NA  NA   NA   NA   NA   NA NA NA
## 
## [[2]]
##      1  2    3    4    5  6  7  8  9 10 11
## 273 NA NA 2069   NA   NA NA NA NA NA NA NA
## 398 NA NA  831 1317 1321 NA NA NA NA NA NA
## 494 NA NA  995 1038   NA NA NA NA NA NA NA
## 504 NA NA  258   NA   NA NA NA NA NA NA NA
## 678 NA NA  578  672 1519 NA NA NA NA NA NA
## NA  NA NA   NA   NA   NA NA NA NA NA NA NA
## 
## [[3]]
##       1  2  3  4  5  6  7  8  9 10 11
## NA   NA NA NA NA NA NA NA NA NA NA NA
## NA.1 NA NA NA NA NA NA NA NA NA NA NA
## NA.2 NA NA NA NA NA NA NA NA NA NA NA
## NA.3 NA NA NA NA NA NA NA NA NA NA NA
## NA.4 NA NA NA NA NA NA NA NA NA NA NA
## NA.5 NA NA NA NA NA NA NA NA NA NA NA

I realised that the break could be replaced with a next if I moved the if statement up a level outside of the j for loop.
Also, I don’t have to do the duplicate visit identification and removal separately.
I only need to consider the visits in the future of the current visit, hence the (i+1):ncol(tmp) term.
This produces a much cleaner chunk of code.

from_visit_seq <- head(seq_along(zz), -1)

for (i in from_visit_seq){

  if (nrow(out[[i]]) == 0) next

  for (j in out[[i]]$person_id){

    # stop at first time outside window
    t <- i + 1
    out_person <- out[[i]][out[[i]]$person_id == j, ]

    # remove these repeat visits in list element
    while(out_person[, t] <= time_window & !is.na(out_person[, t])){
      out[[t]] <- out[[t]][out[[t]]$person_id != j, ]
      t <- t + 1
    }
  }
}

lapply(out[1:3], function(x) x[1:6, 1:11])

## [[1]]
##    1   2    3   4   5    6    7    8    9 10 11
## 1 NA  20   30  NA  NA   NA   NA   NA   NA NA NA
## 2 NA  35   NA  NA  NA   NA   NA   NA   NA NA NA
## 3 NA   6  286 420 803 1026 1290 1546 1633 NA NA
## 4 NA  11  882  NA  NA   NA   NA   NA   NA NA NA
## 5 NA  47  154 168 182  301  329   NA   NA NA NA
## 6 NA 161 1218  NA  NA   NA   NA   NA   NA NA NA
## 
## [[2]]
##      1  2    3    4    5  6  7  8  9 10 11
## 273 NA NA 2069   NA   NA NA NA NA NA NA NA
## 398 NA NA  831 1317 1321 NA NA NA NA NA NA
## 494 NA NA  995 1038   NA NA NA NA NA NA NA
## 504 NA NA  258   NA   NA NA NA NA NA NA NA
## 678 NA NA  578  672 1519 NA NA NA NA NA NA
## NA  NA NA   NA   NA   NA NA NA NA NA NA NA
## 
## [[3]]
##       1  2  3  4  5  6  7  8  9 10 11
## NA   NA NA NA NA NA NA NA NA NA NA NA
## NA.1 NA NA NA NA NA NA NA NA NA NA NA
## NA.2 NA NA NA NA NA NA NA NA NA NA NA
## NA.3 NA NA NA NA NA NA NA NA NA NA NA
## NA.4 NA NA NA NA NA NA NA NA NA NA NA
## NA.5 NA NA NA NA NA NA NA NA NA NA NA

Finally, include a visit count column and remove empty dataframes.

names(out) <- as.character(seq_along(out) + 1)

for (i in names(out)){
  if (nrow(out[[i]]) == 0) out[[i]] <- NULL
  else out[[i]]$visit <- as.numeric(i)
}

Stack the lists and include back in the initial visits to give all visits with duplicates removed.

res <-
  purrr::map_dfr(out, `[`, c("person_id", "visit")) %>% 
  merge(dat) %>% 
  rbind.data.frame(dat[dat$visit == 1, ]) %>% 
  arrange(person_id, visit)

kable(head(res))

person_idvisituniq_iddate
14F00004114F00004_2014-05-072014-05-07
14F00004214F00004_2014-05-272014-05-27
14F00006114F00006_2014-05-072014-05-07
14F00016114F00016_2014-05-072014-05-07
14F00019114F00019_2014-05-072014-05-07
14F00021114F00021_2014-05-072014-05-07
nrow(res)

## [1] 64799

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: