VKCComputing.jl

Load the airtable sample database into memory. Requires that an airtable key with read access and a directory for storing local files are set to preferences. (see set_readonly_pat! and set_airtable_dir!.

Updating local base files

The update keyword argument can take a number of different forms.

A boolean value, which will cause updates to all tables if true, and no tables if false.
An AbstractTime from Dates (eg Week(1)), which will update any table whose local copy was last updated longer ago than this value.
A vector of Pairs of the form "$table_name"=> x, where x is either of the options from (1) or (2) above.

For example, to update the "Biospecimens" table if it's older than a week, and to update the "Projects" table no matter what, call

julia> using VKCComputing, Dates

julia> base = LocalBase(; update=["Biospecimens"=> Week(1), "Projects"=> true]);

Indexing

Indexing into the local base can be done either with the name of a table (eg base["Biospecimens"]), which will return a VKCAirtable, or using a record ID hash (eg base["recUqEcu3pM8p2jzQ"]).

Warning

Note that record ID hashes are identified based on the regular expression r"^rec[A-Za-z0-9]{14}$" - that is, a string starting with "rec", followed by exactly 14 alphanumeric characters. In principle, one could name a table as something that matches this regular expression, causing it to be improperly identified as a record hash rather than a table name.

VCKAirtables can also be indexed with the uid column string, so an individual record can be accessed using eg base["Projects"]["khula"], but a 2-argument indexing option is provided for convenience, eg base["Projects", "khula"].

source

VKCComputing.vkcairtable — Function

vkcairtable(name::String)

Returns a VKCAirtable type based on the table name. Requires that the local preference airtable_dir is set. See VKCComputing.set_preferences!.

source

VKCComputing.localairtable — Function

localairtable(tab::VKCAirtable; update=Month(1))

Create an instance of LocalAirtable, optionally updating the local copy from remote.

source

VKCComputing.uids — Function

uids(tab::LocalAirtable)

Get the keys for the uid column of table tab.

source

uids(base::LocalBase, tab::String)

Get the keys for the uid column of table tab from base.

source

Interacting with records

VKCComputing.resolve_links — Function

resolve_links(base::LocalBase, col; strict = true, unpack = r-> r isa AbstractString ? identity : first)

Resolves a vector of record hashes (or a vector of vectors of record hashes) into the uids of the linked record.

If the strict kwarg is true, it is expected that col is composed of either

a record hash
a one-element Vector containing a record hash

If strict is false, it is recommended to pass a custom function to unpack, which will be called on each row of the col.

Eg.

julia> base = LocalBase();

julia> visits = [rec[:visit] for rec in base["Biospecimens"][["FG00004", "FG00006", "FG00008"]]]
3-element Vector{JSON3.Array{String, Base.CodeUnits{UInt8, String}, SubArray{UInt64, 1, Vector{UInt64}, Tuple{UnitRange{Int64}}, true}}}:
 ["recEnxbSPMNZaoySF"]
 ["recT1EUtiUSZaypxl"]
 ["recyHSMZp0HLErHLz"]

 julia> resolve_links(base, visits)
 3-element Vector{Airtable.AirRecord}:
  Airtable.AirRecord("recEnxbSPMNZaoySF", AirTable("Visits"), (uid = "mc03", Biospecimens = ["recdO7nHQI7VY5ynn", #...
  Airtable.AirRecord("recT1EUtiUSZaypxl", AirTable("Visits"), (uid = "ec02", Biospecimens = ["recmuwWA1bkhpxQ4P", #...
  Airtable.AirRecord("recyHSMZp0HLErHLz", AirTable("Visits"), (uid = "mc05", Biospecimens = ["recOlXNl7OMQH6cpF", #...

julia> julia> seqpreps =  [rec[:seqprep] for rec in base["Biospecimens"][["FG00004", "FG00006", "FG00008"]]]
3-element Vector{JSON3.Array{String, Base.CodeUnits{UInt8, String}, SubArray{UInt64, 1, Vector{UInt64}, Tuple{UnitRange{Int64}}, true}}}:
 ["rec33GrUTnfeNTCXe", "recBh1xD1xOw4qkhO"]
 ["recq5fj9BQb7vugUd"]
 ["recbNNM1qWXOLhnye"]

Notice that the first record here has 2 entries, so strict=true will fail.

julia> resolve_links(base, seqpreps)
ERROR: ArgumentError: At least one record has multiple entries. Use `strict = false` and `unpack` to handle this.
Stacktrace:
#...

If you just pass strict = false, the default unpack function will simply take the first record:

julia> resolve_links(base, seqpreps; strict = false)
3-element Vector{Airtable.AirRecord}:
 Airtable.AirRecord("rec33GrUTnfeNTCXe", AirTable("SequencingPrep"), (uid = "SEQ01071", biospecimen = ["recDcm98dkmNP3Zic"] #...
 Airtable.AirRecord("recq5fj9BQb7vugUd", AirTable("SequencingPrep"), (uid = "SEQ00729", biospecimen = ["recL6D53j76R0eRp5"] #...
 Airtable.AirRecord("recbNNM1qWXOLhnye", AirTable("SequencingPrep"), (uid = "SEQ01960", biospecimen = ["rech7m4F33iGWtgOU"] #...

If you wish to keep all records, use Iterators.flatten(), or pass a custom unpack function:

julia> resolve_links(base, Iterators.flatten(seqpreps); strict = false)
4-element Vector{Airtable.AirRecord}:
 Airtable.AirRecord("rec33GrUTnfeNTCXe", AirTable("SequencingPrep"), (uid = "SEQ01071", biospecimen = ["recDcm98dkmNP3Zic"] #...
 Airtable.AirRecord("recBh1xD1xOw4qkhO", AirTable("SequencingPrep"), (uid = "SEQ02505", biospecimen = ["recDcm98dkmNP3Zic"] #...
 Airtable.AirRecord("recq5fj9BQb7vugUd", AirTable("SequencingPrep"), (uid = "SEQ00729", biospecimen = ["recL6D53j76R0eRp5"] #...
 Airtable.AirRecord("recbNNM1qWXOLhnye", AirTable("SequencingPrep"), (uid = "SEQ01960", biospecimen = ["rech7m4F33iGWtgOU"] #...

julia> resolve_links(base, seqpreps; strict = false, unpack = identity)
3-element Vector{Vector{Airtable.AirRecord}}:
 [Airtable.AirRecord("rec33GrUTnfeNTCXe", AirTable("SequencingPrep"), (uid = "SEQ01071", biospecimen = ["recDcm98dkmNP3Zic"] #...
  Airtable.AirRecord("recBh1xD1xOw4qkhO", AirTable("SequencingPrep"), (uid = "SEQ02505", biospecimen = ["recDcm98dkmNP3Zic"] #...
 ]
 [Airtable.AirRecord("recq5fj9BQb7vugUd", AirTable("SequencingPrep"), (uid = "SEQ00729", biospecimen = ["recL6D53j76R0eRp5"] #...
 [Airtable.AirRecord("recbNNM1qWXOLhnye", AirTable("SequencingPrep"), (uid = "SEQ01960", biospecimen = ["rech7m4F33iGWtgOU"] #...

source

VKCComputing.biospecimens — Function

biospecimens([base::LocalBase, ]project; strict=true)

Get all records from the table Biospecimens belonging to project.

NOTE: strict is set to false by default, and will exclude any records where keep != 1.

source

VKCComputing.seqpreps — Function

seqpreps([base::LocalBase, ]project; strict=true)

Get all records from the table SequencingPrep belonging to project.

NOTE: strict is set to false by default, and will exclude any records where keep != 1.

source

VKCComputing.subjects — Function

subjects([base::LocalBase, ]project; strict=true)

Get all records from the table Subjects belonging to project.

NOTE: strict is set to false by default, and will exclude any records where keep != 1.

source

Interacting with files

VKCComputing.get_analysis_files — Function

get_analysis_files(dir = @load_preference("mgx_analysis_dir"))

Expects the preference mgx_analysis_dir to be set - see set_default_preferences!.

Creates DataFrame with the following headers:

mod: DateTime that the file was last modified
size: (Int) in bytes
path: full remote path (eg /grace/sequencing/processed/mgx/metaphlan/SEQ9999_S42_profile.tsv)
dir: Remote directory for file (eg /grace/sequencing/processed/mgx/metaphlan/), equivalent to dirname(path)
file: Remote file name (eg SEQ9999_S42_profile.tsv)
seqprep: For files that match SEQ\d+_S\d+_.+, the sequencing Prep ID (eg SEQ9999). Otherwise, missing.
S_well: For files that match SEQ\d+_S\d+_.+, the well ID, including S (eg S42). Otherwise, missing.
suffix: For files that match SEQ\d+_S\d+_.+, the remainder of the file name, aside from a leading _ (eg profile.tsv). Otherwise, missing.

Interacting with AWS

VKCComputing.aws_ls — Function

aws_ls(path="s3://vkc-sequencing/processed/mgx/")

Get a (recurssive) listing of files / dicrectories contained at path, and return a DataFrame with the following headers:

mod: DateTime that the file was last modified
size: (Int) in bytes
path: full remote path (eg s3://bucket-name/some/SEQ9999_S42_profile.tsv)
dir: Remote directory for file (eg s3://bucket-name/some/), equivalent to dirname(path)
file: Remote file name (eg SEQ9999_S42_profile.tsv)
seqprep: For files that match SEQ\d+_S\d+_.+, the sequencing Prep ID (eg SEQ9999). Otherwise, missing.
S_well: For files that match SEQ\d+_S\d+_.+, the well ID, including S (eg S42). Otherwise, missing.
suffix: For files that match SEQ\d+_S\d+_.+, the remainder of the file name, aside from a leading _ (eg profile.tsv). Otherwise, missing.

source