Renaming files (from biospecimen IDs and aliases)

Motivation

After changing the way that we label samples, we sometimes need to update a previous file-name or table column name to reflect the new system. This code will not produce the same outputs anymore, since the state of drives and the database has changed in the meantime. It's meant more as documentation of the process.

Getting current data

The first thing to do in most projects is to load the airtable database into memory. If you want to guarantee that you have the most recent version of any particular table, use the update argument of LocalBase.

julia> using VKCComputing, Dates

julia> base = LocalBase(; update=["Biospecimens"=> Week(1), "Projects"=> true]);
[ Info: Table Aliases does not need updating updating. Use `update = true` or `update = {shorter interval}` to override
[ Info: Loading records from local JSON file
[ Info: Table Biospecimens does not need updating updating. Use `update = true` or `update = {shorter interval}` to override
[ Info: Loading records from local JSON file
#...

Let's say I want to find any samples from the ECHO project that need to be updated. I don't remember how the ECHO project is referred to in the database, so I need to check what the uids from the "Projects" table are:

julia> uids(base, "Projects") # or `uids(base["Projects"])`
2-element Dictionaries.Indices{String}
 "resonance"
 "khula"

Looks like it's "resonance"! Next, I'll get all of the samples associated with that project. There are a couple of ways to do that; I'll use a somewhat roundabout way to show off a couple of features:

julia> proj = base["Projects", "resonance"]; # get the project record

julia> keys(proj); # look at what fields are available... I want :Subjects
(:uid, :name, :Visits, :Subjects)

julia> first(proj[:Subjects], 5) # verify that these are record hashes
5-element Vector{String}:
 "recF3MoP0RZ4EqRrh"
 "recnZclKoXvxPbARR"
 "recW3HYwsZPYt8rtD"
 "recmljdPsUa55j2BS"
 "recaRvIXKLtDNQBOb"

Once I have a list of records hashes, I can use them to pull records directly:

julia> subjects = base[proj[:Subjects]]
770-element Vector{Airtable.AirRecord}:
 Airtable.AirRecord("recF3MoP0RZ4EqRrh", AirTable("Subjects"), (uid = "1255", Biospecimens = ["recnfhuOaRwSfOLOq", "recrSIHTxVxj3JXav", "rec7j3VFnZb0Uue5k", "recrfP0SIqxTjk0Vk", "recLsUbcOI32ZjrSN", "rec1Ai2Nz0yCmpLpa", "receGDNvbRuQpiCNQ", "recPVakgZpe01B9ZJ", "rec5iE5o92ManNgDG"], project = ["recOlnbWbRoJdy4Nq"]))
 Airtable.AirRecord("recnZclKoXvxPbARR", AirTable("Subjects"), (uid = "0672", Biospecimens = ["recnKOG7QwOKCG8Tk", "recisA4suwsc4iq3U", "rec98YN0dSis3lxbQ", "recV29l0ADxUL2xk5", "recLRCssF2ZBC8suZ", "rec46VGKOVth2SQ6U", "recDUj6DDLyjNSbki", "reclzmMkTJiWFCJWI", "reczBSh8Vav6D0qQy", "reclUM1eREFipQ6ND"], project = ["recOlnbWbRoJdy4Nq"]))
 #...

Now, I can get all of the biospecimens associated with these subjects:

julia> mapreduce(rec-> get(rec, :Biospecimens, []), vcat, subjects)
3317-element Vector{String}:
 "recnfhuOaRwSfOLOq"
 "recrSIHTxVxj3JXav"
 "rec7j3VFnZb0Uue5k"
 "recrfP0SIqxTjk0Vk"
#...

Then, I can use these record hashes to pull the biospecimen records:

julia> biosp = base[mapreduce(rec-> get(rec, :Biospecimens, []), vcat, subjects)]
3317-element Vector{Airtable.AirRecord}:
 Airtable.AirRecord("recnfhuOaRwSfOLOq", AirTable("Biospecimens"), (uid = "FE50074", subject = ["recF3MoP0RZ4EqRrh"], collection_buffer = ["recxsTHmTS84TBMPF"], aliases = ["recP0jpfk49JXtVp2"], project = ["recOlnbWbRoJdy4Nq"]))
 Airtable.AirRecord("recrSIHTxVxj3JXav", AirTable("Biospecimens"), (uid = "FE01868", subject = ["recF3MoP0RZ4EqRrh"], collection = 3, collection_buffer = ["recxsTHmTS84TBMPF"], visit = ["recjx3Tb7wf6WXO6t"], aliases = ["recsqOcdY54CU5hQT"], project = ["recOlnbWbRoJdy4Nq"]))
#...

A more straightforwards approach would have been to look at all of the biospecimen records, and filter on the ones where Project had the id hash for the "resonance" project.

Building the renaming map

Now that we have all of the biospecimens for ECHO, let's just get the ones that have been shotgun sequenced, and identify the SequencingPrep records associated with them.

julia> mapreduce(keys, union, biosp)
8-element Vector{Symbol}:
 :uid
 :subject
 :collection_buffer
 :aliases
 :project
 :collection
 :visit
 :seqprep

julia> filter!(rec-> haskey(rec, :seqprep), biosp);

Now, we want to build a mapping of biospecimen => seqprep ID. I'll store this as rows in a DataFrame.

julia> rndf = DataFrame(mapreduce(vcat, biosp) do rec
    rows = [(; seqname = base[id][:uid], oldname = rec[:uid]) for id in rec[:seqprep]]
end);

It is in principle possible for a single biospecimen ID to refer to multiple seqprep IDs. Let's check that:

julia> transform!(groupby(rndf, :oldname), "seqname"=> length => "n_seqs");

julia> subset(rndf, "n_seqs"=> ByRow(>(1)))
6×3 DataFrame
 Row │ seqname   oldname  n_seqs
     │ String    String   Int64
─────┼───────────────────────────
   1 │ SEQ02303  FG02294       2
   2 │ SEQ02303  FG02294       2
   3 │ SEQ01232  FE01105       2
   4 │ SEQ01110  FG00016       2
   5 │ SEQ01110  FG00016       2
   6 │ SEQ01232  FE01105       2

Now, we'll build a map of oldname -> seqname.

rnmap = dictionary(zip(rndf.oldname, rndf.seqname))
1799-element Dictionary{String, String}
 "FG50159" │ "SEQ01107"
 "FG50160" │ "SEQ01108"
 "FG00846" │ "SEQ01371"
 #...

Finding files to rename

I'm running this on hopper, so files are contained in the /grace drive, as well as some other places. But we'll start there.

The first thing to do is find all of the files that could plausibly be in ECHO. They all fit the pattern r"F[EG]\d{5} - that is, "FE or "FG" followed by 5 numbers (\d stands for "digit"). Let's make sure that's true for our rename map:

julia> all(k-> contains(k, r"F[EG]\d{5}"), keys(rnmap))
true

So now we'll recurse through the directory, saving any files that fit the pattern. While we're at it, we can pull out some relevant info and push it into a DataFrame.

julia> filedf = DataFrame()

julia>  for (root, dir, files) in walkdir("/grace/sequencing/processed/mgx")
            for file in files
                m = match(r"^(F[EG]\d{5})_(S\d+)_", file)
                isnothing(m) && continue
                newname = get(rnmap, m[1], nothing)
                if isnothing(newname)
                    @warn "$(m[1]) matches the regex, but doesn't have a new name"
                    continue
                end
                push!(filedf, (;
                    oldname = m[1], newname, snum = m[2], oldpath = joinpath(root, file), newpath = joinpath(root, replace(file, m[1]=>newname))
                ))
            end
        end

julia> first(filedf, 5)
5×5 DataFrame
 Row │ oldname    newname   snum       oldpath                            newpath
     │ SubStrin…  String    SubStrin…  String                             String
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ FE01063    SEQ02371  S93        /grace/sequencing/processed/mgx/…  /grace/sequencing/processed/mgx/…
   2 │ FE01063    SEQ02371  S93        /grace/sequencing/processed/mgx/…  /grace/sequencing/processed/mgx/…
   3 │ FE01063    SEQ02371  S93        /grace/sequencing/processed/mgx/…  /grace/sequencing/processed/mgx/…
   4 │ FE01064    SEQ00905  S10        /grace/sequencing/processed/mgx/…  /grace/sequencing/processed/mgx/…
   5 │ FE01064    SEQ00905  S10        /grace/sequencing/processed/mgx/…  /grace/sequencing/processed/mgx/…

Now, we want to double check that all of the oldnames are associated with the same snum to avoid ambiguities. Here, we group by oldname, and then look at the snum column to ensure there is only 1 (we check that the length of unique elements in that column is 1).

julia> transform!(groupby(filedf, :oldname), "snum"=> (sn-> length(unique(sn)) != 1) => "ambiguous");

julia> unique(subset(filedf, "ambiguous"=> identity), ["oldname", "snum"])
16×6 DataFrame
 Row │ oldname    newname   snum       oldpath                            newpath                            ambiguous
     │ SubStrin…  String    SubStrin…  String                             String                             Bool
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ FG00004    SEQ01071  S26        /grace/sequencing/processed/mgx/…  /grace/sequencing/processed/mgx/…       true
   2 │ FG00004    SEQ01071  S46        /grace/sequencing/processed/mgx/…  /grace/sequencing/processed/mgx/…       true
   3 │ FG00005    SEQ01950  S50        /grace/sequencing/processed/mgx/…  /grace/sequencing/processed/mgx/…       true
   4 │ FG00005    SEQ01950  S58        /grace/sequencing/processed/mgx/…  /grace/sequencing/processed/mgx/…       true
   5 │ FG00016    SEQ01110  S52        /grace/sequencing/processed/mgx/…  /grace/sequencing/processed/mgx/…       true
   6 │ FG00016    SEQ01110  S70        /grace/sequencing/processed/mgx/…  /grace/sequencing/processed/mgx/…       true
   7 │ FG00017    SEQ02084  S64        /grace/sequencing/processed/mgx/…  /grace/sequencing/processed/mgx/…       true
   8 │ FG00017    SEQ02084  S82        /grace/sequencing/processed/mgx/…  /grace/sequencing/processed/mgx/…       true
   9 │ FG00021    SEQ01862  S53        /grace/sequencing/processed/mgx/…  /grace/sequencing/processed/mgx/…       true
  10 │ FG00021    SEQ01862  S94        /grace/sequencing/processed/mgx/…  /grace/sequencing/processed/mgx/…       true
  11 │ FG02294    SEQ02303  S1         /grace/sequencing/processed/mgx/…  /grace/sequencing/processed/mgx/…       true
  12 │ FG02294    SEQ02303  S34        /grace/sequencing/processed/mgx/…  /grace/sequencing/processed/mgx/…       true
  13 │ FG02471    SEQ01727  S16        /grace/sequencing/processed/mgx/…  /grace/sequencing/processed/mgx/…       true
  14 │ FG02471    SEQ01727  S89        /grace/sequencing/processed/mgx/…  /grace/sequencing/processed/mgx/…       true
  15 │ FG02614    SEQ01971  S10        /grace/sequencing/processed/mgx/…  /grace/sequencing/processed/mgx/…       true
  16 │ FG02614    SEQ01971  S9         /grace/sequencing/processed/mgx/…  /grace/sequencing/processed/mgx/…       true

So for now, we'll leave these out and deal with them later.

Renaming

julia>  for grp in groupby(filedf, "oldname")
            old = first(grp.oldname)
            @info "Working on $old"
            if any(grp.ambiguous)
                @warn "$old has multiple sequecing results: $(unique(grp.snum)); skipping!"
            else
                @info "Renaming $(old) to $(first(grp.newname))"
                for row in eachrow(grp)
                    @debug "$(row.oldpath) => $(row.newpath)"
                    mv(row.oldpath, row.newpath)
                end
            end
        end

Dealing with Ambiguities

Let's go back to our ambiguous sequences.

julia> ambi = subset(filedf, "ambiguous"=> identity);

julia> unique(select(ambi, ["oldname", "newname", "snum"]))
16×3 DataFrame
 Row │ oldname    newname   snum
     │ SubStrin…  String    SubStrin…
─────┼────────────────────────────────
   1 │ FG00004    SEQ01071  S26
   2 │ FG00004    SEQ01071  S46
   3 │ FG00005    SEQ01950  S50
   4 │ FG00005    SEQ01950  S58
   5 │ FG00016    SEQ01110  S52
   6 │ FG00016    SEQ01110  S70
   7 │ FG00017    SEQ02084  S64
   8 │ FG00017    SEQ02084  S82
   9 │ FG00021    SEQ01862  S53
  10 │ FG00021    SEQ01862  S94
  11 │ FG02294    SEQ02303  S1
  12 │ FG02294    SEQ02303  S34
  13 │ FG02471    SEQ01727  S16
  14 │ FG02471    SEQ01727  S89
  15 │ FG02614    SEQ01971  S10
  16 │ FG02614    SEQ01971  S9

For the first 12 rows (samples FG00004, F00005, FG00016, FG00017, FG0021, and FG02294), the solution is relatively simple. Each of these had 2 aliquots of the same biospecimen sequenced. So in each of those cases, I duplicated the record in the SequencingPrep table (generating a new SEQ id), added the correct S-well number, and put in the old IDs as aliases. Eg, FG00004 is now

uidautonumberbiospecimenswellalias
SEQ010711071FG00004S26C01174F1A
SEQ025052505FG00004S46C01174F1B

For FG02614, there is only a single file (FG02614_S9_pfams.tsv) that has the S9 identifier. Looking at the contents of that file:

❯ head mgx/humann/regroup/FG02614_S9_pfams.tsv
# Gene Family   FG02610_S9_Abundance-RPKs
UNMAPPED        14128546.0
UNGROUPED       29634105.328100212
UNGROUPED|g__Adlercreutzia.s__Adlercreutzia_equolifaciens       17608.29535663359
UNGROUPED|g__Agathobaculum.s__Agathobaculum_butyriciproducens   44303.814025389926
UNGROUPED|g__Akkermansia.s__Akkermansia_muciniphila     118128.883395747
UNGROUPED|g__Alistipes.s__Alistipes_finegoldii  125084.06209765507
UNGROUPED|g__Alistipes.s__Alistipes_indistinctus        24720.017810163998
UNGROUPED|g__Alistipes.s__Alistipes_putredinis  1052709.4440796026
UNGROUPED|g__Alistipes.s__Alistipes_shahii      86567.54527132263

Based on the header, it looks like it belongs to FG02610. We have all of the files for FG02610, so I'm just going to delete these files (the FG02614_S9 ones).

FG02471 is weirder. We have all of the relevant files for both S16 and S89. Looking at the contents,

kevin in vkclab-ada in /lovelace/sequencing/processed on ☁️  (us-east-1)
❯ head -5 mgx/metaphlan/FG02471_S*_profile.tsv
==> mgx/metaphlan/FG02471_S16_profile.tsv <==
#mpa_v30_CHOCOPhlAn_201901
#/home/vklepacc/miniconda3/envs/biobakery3/bin/metaphlan output/kneaddata/FG02471_S16_kneaddata.fastq output/metaphlan/FG02471_S16_profile.tsv --bowtie2out output/metaphlan/FG02471_S16_bowtie2.tsv --samout output/metaphlan/FG02471_S16.sam --input_type fastq --nproc 8 --bowtie2db /pool001/vklepacc/databases/metaphlan
#SampleID       Metaphlan_Analysis
#clade_name     NCBI_tax_id     relative_abundance      additional_species
k__Bacteria     2       99.08787

==> mgx/metaphlan/FG02471_S89_profile.tsv <==
#mpa_v30_CHOCOPhlAn_201901
#/home/vklepacc/miniconda3/envs/biobakery3/bin/metaphlan output/kneaddata/C0477-5F-1A_S89_merged.fastq output/metaphlan/C0477-5F-1A_S89_profile.tsv --bowtie2out output/metaphlan/C0477-5F-1A_S89_bowtie2.tsv --samout output/metaphlan/C0477-5F-1A_S89.sam --input_type fastq --nproc 8
#SampleID       Metaphlan_Analysis
#clade_name     NCBI_tax_id     relative_abundance      additional_species
k__Bacteria     2       99.12309

kevin in vkclab-ada in /lovelace/sequencing/processed on ☁️  (us-east-1)
❯ head -5 mgx/humann/main/FG02471_*genefamilies.tsv
==> mgx/humann/main/FG02471_S16_genefamilies.tsv <==
# Gene Family   FG02471_S16_Abundance-RPKs
UNMAPPED        4356120.0000000000
UniRef90_A0A3E2XRX6     15802.1208367696
UniRef90_A0A3E2XRX6|g__Dorea.s__Dorea_longicatena       15802.1208367696
UniRef90_A5ZYV3 12866.1165754399

==> mgx/humann/main/FG02471_S89_genefamilies.tsv <==
# Gene Family   C0477-5F-1A_S89_Abundance-RPKs
UNMAPPED        3152316.0000000000
UniRef90_A7V4G2 8649.8613270401
UniRef90_A7V4G2|g__Bacteroides.s__Bacteroides_uniformis 8552.3697112122
UniRef90_A7V4G2|unclassified    97.4916158279

The S89 has the old ID in the header (C0477-5F-1A), which is the correct one according to the old database, but it looks like this same ID was sequenced in a different batch (old batch 11) under the ID FG00757. Given the surrounding S-well IDs, it looks like S16, sequenced in (old) batch 17 is correctly attributed to FG02471, while S89 should go with FG00757.

Conclusion

This example shows how to download and access the local airtable base, use it to generate a rename mapping, and use that to rename files, taking into account ambiguities.