VKCComputing.jl

Documentation for VKCComputing.

Setup environment

VKCComputing.set_airtable_dir!Function
set_airtable_dir!(key)

Sets local preferences for airtable_dir to key (defaults to the environmental variable "AIRTABLE_DIR" if set).

source
VKCComputing.set_readonly_pat!Function
set_readonly_pat!(key)

Sets local preferences for readonly_pat to key (defaults to the environmental variable "AIRTABLE_KEY" if set).

source
VKCComputing.set_readwrite_pat!Function
set_readwrite_pat!(key)

Sets local preferences for readwrite_pat to key (defaults to the environmental variable "AIRTABLE_RW_KEY" if set).

source

Interacting with Airtable

VKCComputing.LocalBaseType
LocalBase(; update=Month(1))

Load the airtable sample database into memory. Requires that an airtable key with read access and a directory for storing local files are set to preferences. (see set_readonly_pat! and set_airtable_dir!.

Updating local base files

The update keyword argument can take a number of different forms.

  1. A boolean value, which will cause updates to all tables if true, and no tables if false.
  2. An AbstractTime from Dates (eg Week(1)), which will update any table whose local copy was last updated longer ago than this value.
  3. A vector of Pairs of the form "$table_name"=> x, where x is either of the options from (1) or (2) above.

For example, to update the "Biospecimens" table if it's older than a week, and to update the "Projects" table no matter what, call

julia> using VKCComputing, Dates

julia> base = LocalBase(; update=["Biospecimens"=> Week(1), "Projects"=> true]);

Indexing

Indexing into the local base can be done either with the name of a table (eg base["Biospecimens"]), which will return a VKCAirtable, or using a record ID hash (eg base["recUqEcu3pM8p2jzQ"]).

Warning

Note that record ID hashes are identified based on the regular expression r"^rec[A-Za-z0-9]{14}$" - that is, a string starting with "rec", followed by exactly 14 alphanumeric characters. In principle, one could name a table as something that matches this regular expression, causing it to be improperly identified as a record hash rather than a table name.

VCKAirtables can also be indexed with the uid column string, so an individual record can be accessed using eg base["Projects"]["khula"], but a 2-argument indexing option is provided for convenience, eg base["Projects", "khula"].

source
VKCComputing.uidsFunction
uids(tab::LocalAirtable)

Get the keys for the uid column of table tab.

source
uids(base::LocalBase, tab::String)

Get the keys for the uid column of table tab from base.

source

Interacting with records

VKCComputing.resolve_linksFunction
resolve_links(base::LocalBase, col; strict = true, unpack = r-> r isa AbstractString ? identity : first)

Resolves a vector of record hashes (or a vector of vectors of record hashes) into the uids of the linked record.

If the strict kwarg is true, it is expected that col is composed of either

  1. a record hash
  2. a one-element Vector containing a record hash

If strict is false, it is recommended to pass a custom function to unpack, which will be called on each row of the col.

Eg.

julia> base = LocalBase();

julia> visits = [rec[:visit] for rec in base["Biospecimens"][["FG00004", "FG00006", "FG00008"]]]
3-element Vector{JSON3.Array{String, Base.CodeUnits{UInt8, String}, SubArray{UInt64, 1, Vector{UInt64}, Tuple{UnitRange{Int64}}, true}}}:
 ["recEnxbSPMNZaoySF"]
 ["recT1EUtiUSZaypxl"]
 ["recyHSMZp0HLErHLz"]

 julia> resolve_links(base, visits)
 3-element Vector{Airtable.AirRecord}:
  Airtable.AirRecord("recEnxbSPMNZaoySF", AirTable("Visits"), (uid = "mc03", Biospecimens = ["recdO7nHQI7VY5ynn", #...
  Airtable.AirRecord("recT1EUtiUSZaypxl", AirTable("Visits"), (uid = "ec02", Biospecimens = ["recmuwWA1bkhpxQ4P", #...
  Airtable.AirRecord("recyHSMZp0HLErHLz", AirTable("Visits"), (uid = "mc05", Biospecimens = ["recOlXNl7OMQH6cpF", #...

julia> julia> seqpreps =  [rec[:seqprep] for rec in base["Biospecimens"][["FG00004", "FG00006", "FG00008"]]]
3-element Vector{JSON3.Array{String, Base.CodeUnits{UInt8, String}, SubArray{UInt64, 1, Vector{UInt64}, Tuple{UnitRange{Int64}}, true}}}:
 ["rec33GrUTnfeNTCXe", "recBh1xD1xOw4qkhO"]
 ["recq5fj9BQb7vugUd"]
 ["recbNNM1qWXOLhnye"]

Notice that the first record here has 2 entries, so strict=true will fail.

julia> resolve_links(base, seqpreps)
ERROR: ArgumentError: At least one record has multiple entries. Use `strict = false` and `unpack` to handle this.
Stacktrace:
#...

If you just pass strict = false, the default unpack function will simply take the first record:

julia> resolve_links(base, seqpreps; strict = false)
3-element Vector{Airtable.AirRecord}:
 Airtable.AirRecord("rec33GrUTnfeNTCXe", AirTable("SequencingPrep"), (uid = "SEQ01071", biospecimen = ["recDcm98dkmNP3Zic"] #...
 Airtable.AirRecord("recq5fj9BQb7vugUd", AirTable("SequencingPrep"), (uid = "SEQ00729", biospecimen = ["recL6D53j76R0eRp5"] #...
 Airtable.AirRecord("recbNNM1qWXOLhnye", AirTable("SequencingPrep"), (uid = "SEQ01960", biospecimen = ["rech7m4F33iGWtgOU"] #...

If you wish to keep all records, use Iterators.flatten(), or pass a custom unpack function:

julia> resolve_links(base, Iterators.flatten(seqpreps); strict = false)
4-element Vector{Airtable.AirRecord}:
 Airtable.AirRecord("rec33GrUTnfeNTCXe", AirTable("SequencingPrep"), (uid = "SEQ01071", biospecimen = ["recDcm98dkmNP3Zic"] #...
 Airtable.AirRecord("recBh1xD1xOw4qkhO", AirTable("SequencingPrep"), (uid = "SEQ02505", biospecimen = ["recDcm98dkmNP3Zic"] #...
 Airtable.AirRecord("recq5fj9BQb7vugUd", AirTable("SequencingPrep"), (uid = "SEQ00729", biospecimen = ["recL6D53j76R0eRp5"] #...
 Airtable.AirRecord("recbNNM1qWXOLhnye", AirTable("SequencingPrep"), (uid = "SEQ01960", biospecimen = ["rech7m4F33iGWtgOU"] #...

julia> resolve_links(base, seqpreps; strict = false, unpack = identity)
3-element Vector{Vector{Airtable.AirRecord}}:
 [Airtable.AirRecord("rec33GrUTnfeNTCXe", AirTable("SequencingPrep"), (uid = "SEQ01071", biospecimen = ["recDcm98dkmNP3Zic"] #...
  Airtable.AirRecord("recBh1xD1xOw4qkhO", AirTable("SequencingPrep"), (uid = "SEQ02505", biospecimen = ["recDcm98dkmNP3Zic"] #...
 ]
 [Airtable.AirRecord("recq5fj9BQb7vugUd", AirTable("SequencingPrep"), (uid = "SEQ00729", biospecimen = ["recL6D53j76R0eRp5"] #...
 [Airtable.AirRecord("recbNNM1qWXOLhnye", AirTable("SequencingPrep"), (uid = "SEQ01960", biospecimen = ["rech7m4F33iGWtgOU"] #...
source
VKCComputing.biospecimensFunction
biospecimens([base::LocalBase, ]project; strict=true)

Get all records from the table Biospecimens belonging to project.

NOTE: strict is set to false by default, and will exclude any records where keep != 1.

source
VKCComputing.seqprepsFunction
seqpreps([base::LocalBase, ]project; strict=true)

Get all records from the table SequencingPrep belonging to project.

NOTE: strict is set to false by default, and will exclude any records where keep != 1.

source
VKCComputing.subjectsFunction
subjects([base::LocalBase, ]project; strict=true)

Get all records from the table Subjects belonging to project.

NOTE: strict is set to false by default, and will exclude any records where keep != 1.

source

Interacting with files

VKCComputing.get_analysis_filesFunction
get_analysis_files(dir = @load_preference("mgx_analysis_dir"))

Expects the preference mgx_analysis_dir to be set - see set_default_preferences!.

Creates DataFrame with the following headers:

  • mod: DateTime that the file was last modified
  • size: (Int) in bytes
  • path: full remote path (eg /grace/sequencing/processed/mgx/metaphlan/SEQ9999_S42_profile.tsv)
  • dir: Remote directory for file (eg /grace/sequencing/processed/mgx/metaphlan/), equivalent to dirname(path)
  • file: Remote file name (eg SEQ9999_S42_profile.tsv)
  • seqprep: For files that match SEQ\d+_S\d+_.+, the sequencing Prep ID (eg SEQ9999). Otherwise, missing.
  • S_well: For files that match SEQ\d+_S\d+_.+, the well ID, including S (eg S42). Otherwise, missing.
  • suffix: For files that match SEQ\d+_S\d+_.+, the remainder of the file name, aside from a leading _ (eg profile.tsv). Otherwise, missing.

See also aws_ls

source

Interacting with AWS

VKCComputing.aws_lsFunction
aws_ls(path="s3://vkc-sequencing/processed/mgx/")

Get a (recurssive) listing of files / dicrectories contained at path, and return a DataFrame with the following headers:

  • mod: DateTime that the file was last modified
  • size: (Int) in bytes
  • path: full remote path (eg s3://bucket-name/some/SEQ9999_S42_profile.tsv)
  • dir: Remote directory for file (eg s3://bucket-name/some/), equivalent to dirname(path)
  • file: Remote file name (eg SEQ9999_S42_profile.tsv)
  • seqprep: For files that match SEQ\d+_S\d+_.+, the sequencing Prep ID (eg SEQ9999). Otherwise, missing.
  • S_well: For files that match SEQ\d+_S\d+_.+, the well ID, including S (eg S42). Otherwise, missing.
  • suffix: For files that match SEQ\d+_S\d+_.+, the remainder of the file name, aside from a leading _ (eg profile.tsv). Otherwise, missing.
source