VKCComputing.jl
Documentation for VKCComputing.
VKCComputing.LocalAirtable
VKCComputing.LocalBase
VKCComputing.VKCAirtable
VKCComputing.audit_analysis_files
VKCComputing.audit_tools
VKCComputing.aws_ls
VKCComputing.biospecimens
VKCComputing.get_analysis_files
VKCComputing.localairtable
VKCComputing.resolve_links
VKCComputing.seqpreps
VKCComputing.set_airtable_dir!
VKCComputing.set_default_preferences!
VKCComputing.set_readonly_pat!
VKCComputing.set_readwrite_pat!
VKCComputing.subjects
VKCComputing.uids
VKCComputing.vkcairtable
Setup environment
VKCComputing.set_default_preferences!
— FunctionTODO:
- set airtable dir using call to scratch drivbe with user name
- throw warnings if any of the directories don't exist
VKCComputing.set_airtable_dir!
— Functionset_airtable_dir!(key)
Sets local preferences for airtable_dir
to key
(defaults to the environmental variable "AIRTABLE_DIR"
if set).
VKCComputing.set_readonly_pat!
— Functionset_readonly_pat!(key)
Sets local preferences for readonly_pat
to key
(defaults to the environmental variable "AIRTABLE_KEY"
if set).
VKCComputing.set_readwrite_pat!
— Functionset_readwrite_pat!(key)
Sets local preferences for readwrite_pat
to key
(defaults to the environmental variable "AIRTABLE_RW_KEY"
if set).
Interacting with Airtable
VKCComputing.VKCAirtable
— TypeVKCAirtable(base, name, localpath)
Connecting Airtable tables with local instances. Generally, use vkcairtable
to create.
VKCComputing.LocalAirtable
— TypeLocalAirtable(table, data, uididx)
Primary data structure for interacting with airtable-based data.
VKCComputing.LocalBase
— TypeLocalBase(; update=Month(1))
Load the airtable sample database into memory. Requires that an airtable key with read access and a directory for storing local files are set to preferences. (see set_readonly_pat!
and set_airtable_dir!
.
Updating local base files
The update
keyword argument can take a number of different forms.
- A boolean value, which will cause updates to all tables if
true
, and no tables iffalse
. - An
AbstractTime
fromDates
(egWeek(1)
), which will update any table whose local copy was last updated longer ago than this value. - A vector of
Pair
s of the form"$table_name"=> x
, wherex
is either of the options from (1) or (2) above.
For example, to update the "Biospecimens" table if it's older than a week, and to update the "Projects" table no matter what, call
julia> using VKCComputing, Dates
julia> base = LocalBase(; update=["Biospecimens"=> Week(1), "Projects"=> true]);
Indexing
Indexing into the local base can be done either with the name of a table (eg base["Biospecimens"]
), which will return a VKCAirtable
, or using a record ID hash (eg base["recUqEcu3pM8p2jzQ"]
).
Note that record ID hashes are identified based on the regular expression r"^rec[A-Za-z0-9]{14}$"
- that is, a string starting with "rec"
, followed by exactly 14 alphanumeric characters. In principle, one could name a table as something that matches this regular expression, causing it to be improperly identified as a record hash rather than a table name.
VCKAirtable
s can also be indexed with the uid
column string, so an individual record can be accessed using eg base["Projects"]["khula"]
, but a 2-argument indexing option is provided for convenience, eg base["Projects", "khula"]
.
VKCComputing.vkcairtable
— Functionvkcairtable(name::String)
Returns a VKCAirtable type based on the table name. Requires that the local preference airtable_dir
is set. See VKCComputing.set_preferences!.
VKCComputing.localairtable
— Functionlocalairtable(tab::VKCAirtable; update=Month(1))
Create an instance of LocalAirtable
, optionally updating the local copy from remote.
VKCComputing.uids
— Functionuids(tab::LocalAirtable)
Get the keys for the uid
column of table tab
.
uids(base::LocalBase, tab::String)
Get the keys for the uid
column of table tab
from base
.
Interacting with records
VKCComputing.resolve_links
— Functionresolve_links(base::LocalBase, col; strict = true, unpack = r-> r isa AbstractString ? identity : first)
Resolves a vector of record hashes (or a vector of vectors of record hashes) into the uid
s of the linked record.
If the strict
kwarg is true
, it is expected that col
is composed of either
- a record hash
- a one-element Vector containing a record hash
If strict
is false
, it is recommended to pass a custom function to unpack
, which will be called on each row of the col
.
Eg.
julia> base = LocalBase();
julia> visits = [rec[:visit] for rec in base["Biospecimens"][["FG00004", "FG00006", "FG00008"]]]
3-element Vector{JSON3.Array{String, Base.CodeUnits{UInt8, String}, SubArray{UInt64, 1, Vector{UInt64}, Tuple{UnitRange{Int64}}, true}}}:
["recEnxbSPMNZaoySF"]
["recT1EUtiUSZaypxl"]
["recyHSMZp0HLErHLz"]
julia> resolve_links(base, visits)
3-element Vector{Airtable.AirRecord}:
Airtable.AirRecord("recEnxbSPMNZaoySF", AirTable("Visits"), (uid = "mc03", Biospecimens = ["recdO7nHQI7VY5ynn", #...
Airtable.AirRecord("recT1EUtiUSZaypxl", AirTable("Visits"), (uid = "ec02", Biospecimens = ["recmuwWA1bkhpxQ4P", #...
Airtable.AirRecord("recyHSMZp0HLErHLz", AirTable("Visits"), (uid = "mc05", Biospecimens = ["recOlXNl7OMQH6cpF", #...
julia> julia> seqpreps = [rec[:seqprep] for rec in base["Biospecimens"][["FG00004", "FG00006", "FG00008"]]]
3-element Vector{JSON3.Array{String, Base.CodeUnits{UInt8, String}, SubArray{UInt64, 1, Vector{UInt64}, Tuple{UnitRange{Int64}}, true}}}:
["rec33GrUTnfeNTCXe", "recBh1xD1xOw4qkhO"]
["recq5fj9BQb7vugUd"]
["recbNNM1qWXOLhnye"]
Notice that the first record here has 2 entries, so strict=true
will fail.
julia> resolve_links(base, seqpreps)
ERROR: ArgumentError: At least one record has multiple entries. Use `strict = false` and `unpack` to handle this.
Stacktrace:
#...
If you just pass strict = false
, the default unpack
function will simply take the first record:
julia> resolve_links(base, seqpreps; strict = false)
3-element Vector{Airtable.AirRecord}:
Airtable.AirRecord("rec33GrUTnfeNTCXe", AirTable("SequencingPrep"), (uid = "SEQ01071", biospecimen = ["recDcm98dkmNP3Zic"] #...
Airtable.AirRecord("recq5fj9BQb7vugUd", AirTable("SequencingPrep"), (uid = "SEQ00729", biospecimen = ["recL6D53j76R0eRp5"] #...
Airtable.AirRecord("recbNNM1qWXOLhnye", AirTable("SequencingPrep"), (uid = "SEQ01960", biospecimen = ["rech7m4F33iGWtgOU"] #...
If you wish to keep all records, use Iterators.flatten()
, or pass a custom unpack
function:
julia> resolve_links(base, Iterators.flatten(seqpreps); strict = false)
4-element Vector{Airtable.AirRecord}:
Airtable.AirRecord("rec33GrUTnfeNTCXe", AirTable("SequencingPrep"), (uid = "SEQ01071", biospecimen = ["recDcm98dkmNP3Zic"] #...
Airtable.AirRecord("recBh1xD1xOw4qkhO", AirTable("SequencingPrep"), (uid = "SEQ02505", biospecimen = ["recDcm98dkmNP3Zic"] #...
Airtable.AirRecord("recq5fj9BQb7vugUd", AirTable("SequencingPrep"), (uid = "SEQ00729", biospecimen = ["recL6D53j76R0eRp5"] #...
Airtable.AirRecord("recbNNM1qWXOLhnye", AirTable("SequencingPrep"), (uid = "SEQ01960", biospecimen = ["rech7m4F33iGWtgOU"] #...
julia> resolve_links(base, seqpreps; strict = false, unpack = identity)
3-element Vector{Vector{Airtable.AirRecord}}:
[Airtable.AirRecord("rec33GrUTnfeNTCXe", AirTable("SequencingPrep"), (uid = "SEQ01071", biospecimen = ["recDcm98dkmNP3Zic"] #...
Airtable.AirRecord("recBh1xD1xOw4qkhO", AirTable("SequencingPrep"), (uid = "SEQ02505", biospecimen = ["recDcm98dkmNP3Zic"] #...
]
[Airtable.AirRecord("recq5fj9BQb7vugUd", AirTable("SequencingPrep"), (uid = "SEQ00729", biospecimen = ["recL6D53j76R0eRp5"] #...
[Airtable.AirRecord("recbNNM1qWXOLhnye", AirTable("SequencingPrep"), (uid = "SEQ01960", biospecimen = ["rech7m4F33iGWtgOU"] #...
VKCComputing.biospecimens
— Functionbiospecimens([base::LocalBase, ]project; strict=true)
Get all records from the table Biospecimens
belonging to project
.
NOTE: strict
is set to false by default, and will exclude any records where keep != 1
.
VKCComputing.seqpreps
— Functionseqpreps([base::LocalBase, ]project; strict=true)
Get all records from the table SequencingPrep
belonging to project
.
NOTE: strict
is set to false by default, and will exclude any records where keep != 1
.
VKCComputing.subjects
— Functionsubjects([base::LocalBase, ]project; strict=true)
Get all records from the table Subjects
belonging to project
.
NOTE: strict
is set to false by default, and will exclude any records where keep != 1
.
Interacting with files
VKCComputing.get_analysis_files
— Functionget_analysis_files(dir = @load_preference("mgx_analysis_dir"))
Expects the preference mgx_analysis_dir
to be set - see set_default_preferences!
.
Creates DataFrame with the following headers:
mod
:DateTime
that the file was last modifiedsize
: (Int
) in bytespath
: full remote path (eg/grace/sequencing/processed/mgx/metaphlan/SEQ9999_S42_profile.tsv
)dir
: Remote directory for file (eg/grace/sequencing/processed/mgx/metaphlan/
), equivalent todirname(path)
file
: Remote file name (egSEQ9999_S42_profile.tsv
)seqprep
: For files that matchSEQ\d+_S\d+_.+
, the sequencing Prep ID (egSEQ9999
). Otherwise,missing
.S_well
: For files that matchSEQ\d+_S\d+_.+
, the well ID, includingS
(egS42
). Otherwise,missing
.suffix
: For files that matchSEQ\d+_S\d+_.+
, the remainder of the file name, aside from a leading_
(egprofile.tsv
). Otherwise,missing
.
See also aws_ls
VKCComputing.audit_analysis_files
— Functionaudit_analysis_files(analysis_files; base = LocalBase())
WIP
VKCComputing.audit_tools
— Functionaudit_tools(df::DataFrame; group_col="seqprep")
WIP
Interacting with AWS
VKCComputing.aws_ls
— Functionaws_ls(path="s3://vkc-sequencing/processed/mgx/")
Get a (recurssive) listing of files / dicrectories contained at path
, and return a DataFrame
with the following headers:
mod
:DateTime
that the file was last modifiedsize
: (Int
) in bytespath
: full remote path (egs3://bucket-name/some/SEQ9999_S42_profile.tsv
)dir
: Remote directory for file (egs3://bucket-name/some/
), equivalent todirname(path)
file
: Remote file name (egSEQ9999_S42_profile.tsv
)seqprep
: For files that matchSEQ\d+_S\d+_.+
, the sequencing Prep ID (egSEQ9999
). Otherwise,missing
.S_well
: For files that matchSEQ\d+_S\d+_.+
, the well ID, includingS
(egS42
). Otherwise,missing
.suffix
: For files that matchSEQ\d+_S\d+_.+
, the remainder of the file name, aside from a leading_
(egprofile.tsv
). Otherwise,missing
.