SRA ToolkitによるSRAデータの取得
NCBIのSRAデータを取得するには SRA Toolkit を用いる。SRA Toolkit のインストールはこちら。
Accession ID の確認
まず、NCBIのSRAデータベースで目的のSRAファイルを検索し、Accession ID を確認する。複数のデータを取得したい場合は Accession List を取得しておく。
SRAファイルの取得
# DRR084187のsraファイルをカレントディレクトリに取得
prefetch --output-directory ./ DRR084187
# Accession List から取得する場合
prefetch --output-directory ./ --option-file SRR_Acc_List.txt
fastqファイルに変換
# シングルエンドのsraファイルをfastqファイルに変換し、gzip圧縮
fastq-dump --gzip --defline-seq '@$sn[_$rn]/$ri' DRR084187.sra
# ペアエンドのsraファイルをfastqファイルに変換し、bzip2圧縮
fastq-dump --bzip2 --split-files --defline-seq '@$sn[_$rn]/$ri' DRR028826.sra
なお、” –defline-seq ‘@$sn[_$rn]/$ri’ ” はfastqファイルのID名の出力形式をIllimina形式で指定している(Trinityのようなソフトウェアの一部はIllumina形式でなければならない)。また、fastqファイルの3行目は標準では”+”のあとにIDが表示されるが、多くの場合”+”のみでOKなので、” –defline-qual ‘+’ ” を指定することでファイルサイズを小さくできる。
複数の変換を一度におこなう場合は次のようにfor文で繰り返せばよい。
# カレントディレクトリにあるsraファイルをすべてfastqに変換
for file in ./*.sra ;
do
fastq-dump --bzip2 --split-files --defline-qual '+' --defline-seq '@$sn[_$rn]/$ri' $file;
done
各コマンドのUsage
prefetchコマンドのusageは次のよう。
$ prefetch --help
Usage:
prefetch [options] <SRA accession | kart file> [...]
Download SRA or dbGaP files and their dependencies
prefetch [options] <SRA file> [...]
Check SRA file for missed dependencies and download them
prefetch --list <kart file> [...]
List content of kart file
Options:
-T|--type Specify file type to download. Default: sra
-t|--transport <value> Transport: one of: fasp; http; both. (fasp
only; http only; first try fasp (ascp), use
http if cannot download using fasp).
Default: both
-N|--min-size <size> Minimum file size to download in KB
(inclusive).
-X|--max-size <size> Maximum file size to download in KB
(exclusive). Default: 20G
-f|--force <value> Force object download one of: no, yes,
all. no [default]: skip download if the
object if found and complete; yes: download
it even if it is found and is complete; all:
ignore lock files (stale locks or it is
being downloaded by another process: use at
your own risk!)
-p|--progress <value> Time period in minutes to display download
progress (0: no progress), default: 1
--eliminate-quals Don't download QUALITY column
-c|--check-all Double-check all refseqs
-l|--list List the content of kart file
-n|--numbered-list List the content of kart file with kart
row numbers
-s|--list-sizes List the content of kart file with target
file sizes
-R|--rows <rows> Kart rows to download (default all). row
list should be ordered
-o|--order <value> Kart prefetch order when downloading
kart: one of: kart, size. (in kart order, by
file size: smallest first), default: size
-a|--ascp-path <ascp-binary|private-key-file> Path to ascp program and
private key file (asperaweb_id_dsa.putty)
--ascp-options <value> Arbitrary options to pass to ascp command
line
-o|--output-file <FILE> Write file to FILE when downloading
single file
-O|--output-directory <DIRECTORY> Save files to DIRECTORY/
-h|--help Output brief explanation for the program.
-V|--version Display the version of the program then
quit.
-L|--log-level <level> Logging level as number or enum string. One
of (fatal|sys|int|err|warn|info|debug) or
(0-6) Current/default is warn
-v|--verbose Increase the verbosity of the program
status messages. Use multiple times for more
verbosity. Negates quiet.
-q|--quiet Turn off all status messages for the
program. Negated by verbose.
--option-file <file> Read more options and parameters from the
file.
prefetch : 2.9.6
fastq-dumpのusageは次のよう。
$ fastq-dump --help
Usage:
fastq-dump [options] <path> [<path>...]
fastq-dump [options] <accession>
INPUT
-A|--accession <accession> Replaces accession derived from <path> in
filename(s) and deflines (only for single
table dump)
--table <table-name> Table name within cSRA object, default is
"SEQUENCE"
PROCESSING
Read Splitting Sequence data may be used in raw form or
split into individual reads
--split-spot Split spots into individual reads
Full Spot Filters Applied to the full spot independently
of --split-spot
-N|--minSpotId <rowid> Minimum spot id
-X|--maxSpotId <rowid> Maximum spot id
--spot-groups <[list]> Filter by SPOT_GROUP (member): name[,...]
-W|--clip Remove adapter sequences from reads
Common Filters Applied to spots when --split-spot is not
set, otherwise - to individual reads
-M|--minReadLen <len> Filter by sequence length >= <len>
-R|--read-filter <[filter]> Split into files by READ_FILTER value
optionally filter by value:
pass|reject|criteria|redacted
-E|--qual-filter Filter used in early 1000 Genomes data: no
sequences starting or ending with >= 10N
--qual-filter-1 Filter used in current 1000 Genomes data
Filters based on alignments Filters are active when alignment
data are present
--aligned Dump only aligned sequences
--unaligned Dump only unaligned sequences
--aligned-region <name[:from-to]> Filter by position on genome. Name can
either be accession.version (ex:
NC_000001.10) or file specific name (ex:
"chr1" or "1"). "from" and "to" are 1-based
coordinates
--matepair-distance <from-to|unknown> Filter by distance between matepairs.
Use "unknown" to find matepairs split
between the references. Use from-to to limit
matepair distance on the same reference
Filters for individual reads Applied only with --split-spot set
--skip-technical Dump only biological reads
OUTPUT
-O|--outdir <path> Output directory, default is working
directory '.' )
-Z|--stdout Output to stdout, all split data become
joined into single stream
--gzip Compress output using gzip
--bzip2 Compress output using bzip2
Multiple File Options Setting these options will produce more
than 1 file, each of which will be suffixed
according to splitting criteria.
--split-files Dump each read into separate file.Files
will receive suffix corresponding to read
number
--split-3 Legacy 3-file splitting for mate-pairs:
First biological reads satisfying dumping
conditions are placed in files *_1.fastq and
*_2.fastq If only one biological read is
present it is placed in *.fastq Biological
reads and above are ignored.
-G|--spot-group Split into files by SPOT_GROUP (member name)
-R|--read-filter <[filter]> Split into files by READ_FILTER value
optionally filter by value:
pass|reject|criteria|redacted
-T|--group-in-dirs Split into subdirectories instead of files
-K|--keep-empty-files Do not delete empty files
FORMATTING
Sequence
-C|--dumpcs <[cskey]> Formats sequence using color space (default
for SOLiD),"cskey" may be specified for
translation
-B|--dumpbase Formats sequence using base space (default
for other than SOLiD).
Quality
-Q|--offset <integer> Offset to use for quality conversion,
default is 33
--fasta <[line width]> FASTA only, no qualities, optional line
wrap width (set to zero for no wrapping)
--suppress-qual-for-cskey suppress quality-value for cskey
Defline
-F|--origfmt Defline contains only original sequence name
-I|--readids Append read id after spot id as
'accession.spot.readid' on defline
--helicos Helicos style defline
--defline-seq <fmt> Defline format specification for sequence.
--defline-qual <fmt> Defline format specification for quality.
<fmt> is string of characters and/or
variables. The variables can be one of: $ac
- accession, $si spot id, $sn spot
name, $sg spot group (barcode), $sl spot
length in bases, $ri read number, $rn
read name, $rl read length in bases. '[]'
could be used for an optional output: if
all vars in [] yield empty values whole
group is not printed. Empty value is empty
string or for numeric variables. Ex:
@$sn[_$rn]/$ri '_$rn' is omitted if name
is empty
OTHER:
--disable-multithreading disable multithreading
-h|--help Output brief explanation of program usage
-V|--version Display the version of the program
-L|--log-level <level> Logging level as number or enum string One
of (fatal|sys|int|err|warn|info) or (0-5)
Current/default is warn
-v|--verbose Increase the verbosity level of the program
Use multiple times for more verbosity
--ncbi_error_report Control program execution environment
report generation (if implemented). One of
(never|error|always). Default is error
--legacy-report use legacy style 'Written spots' for tool
fastq-dump : 2.9.6