2019年11月4日 / 最終更新日 : 2019年12月13日 rbiology Bioinformatics

SRA ToolkitによるSRAデータの取得

NCBIのSRAデータを取得するには SRA Toolkit を用いる。SRA Toolkit のインストールはこちら。

Accession ID の確認

まず、NCBIのSRAデータベースで目的のSRAファイルを検索し、Accession ID を確認する。複数のデータを取得したい場合は Accession List を取得しておく。

SRAファイルの取得

# DRR084187のsraファイルをカレントディレクトリに取得
prefetch --output-directory ./ DRR084187
# Accession List から取得する場合
prefetch --output-directory ./ --option-file SRR_Acc_List.txt

fastqファイルに変換

# シングルエンドのsraファイルをfastqファイルに変換し、gzip圧縮
fastq-dump --gzip --defline-seq '@$sn[_$rn]/$ri' DRR084187.sra
# ペアエンドのsraファイルをfastqファイルに変換し、bzip2圧縮
fastq-dump --bzip2 --split-files --defline-seq '@$sn[_$rn]/$ri' DRR028826.sra

なお、” –defline-seq ‘@$sn[_$rn]/$ri’ ” はfastqファイルのID名の出力形式をIllimina形式で指定している（Trinityのようなソフトウェアの一部はIllumina形式でなければならない）。また、fastqファイルの3行目は標準では”+”のあとにIDが表示されるが、多くの場合”+”のみでOKなので、” –defline-qual ‘+’ ” を指定することでファイルサイズを小さくできる。

複数の変換を一度におこなう場合は次のようにfor文で繰り返せばよい。

# カレントディレクトリにあるsraファイルをすべてfastqに変換
for file in ./*.sra ;
do
fastq-dump --bzip2 --split-files --defline-qual '+' --defline-seq '@$sn[_$rn]/$ri' $file;
done

各コマンドのUsage

prefetchコマンドのusageは次のよう。

$ prefetch --help
Usage:
  prefetch [options] <SRA accession | kart file> [...]
  Download SRA or dbGaP files and their dependencies
  prefetch [options] <SRA file> [...]
  Check SRA file for missed dependencies and download them
  prefetch --list <kart file> [...]
  List content of kart file
Options:
  -T|--type                        Specify file type to download. Default: sra
  -t|--transport <value>           Transport: one of: fasp; http; both. (fasp
                                   only; http only; first try fasp (ascp), use
                                   http if cannot download using fasp).
                                   Default: both
  -N|--min-size <size>             Minimum file size to download in KB
                                   (inclusive).
  -X|--max-size <size>             Maximum file size to download in KB
                                   (exclusive). Default: 20G
  -f|--force <value>               Force object download one of: no, yes,
                                   all. no [default]: skip download if the
                                   object if found and complete; yes: download
                                   it even if it is found and is complete; all:
                                   ignore lock files (stale locks or it is
                                   being downloaded by another process: use at
                                   your own risk!)
  -p|--progress <value>            Time period in minutes to display download
                                   progress (0: no progress), default: 1
  --eliminate-quals                Don't download QUALITY column
  -c|--check-all                   Double-check all refseqs
  -l|--list                        List the content of kart file
  -n|--numbered-list               List the content of kart file with kart
                                   row numbers
  -s|--list-sizes                  List the content of kart file with target
                                   file sizes
  -R|--rows <rows>                 Kart rows to download (default all). row
                                   list should be ordered
  -o|--order <value>               Kart prefetch order when downloading
                                   kart: one of: kart, size. (in kart order, by
                                   file size: smallest first), default: size
  -a|--ascp-path <ascp-binary|private-key-file>  Path to ascp program and
                                   private key file (asperaweb_id_dsa.putty)
  --ascp-options <value>           Arbitrary options to pass to ascp command
                                   line
  -o|--output-file <FILE>          Write file to FILE when downloading
                                   single file
  -O|--output-directory <DIRECTORY>  Save files to DIRECTORY/
  -h|--help                        Output brief explanation for the program.
  -V|--version                     Display the version of the program then
                                   quit.
  -L|--log-level <level>           Logging level as number or enum string. One
                                   of (fatal|sys|int|err|warn|info|debug) or
                                   (0-6) Current/default is warn
  -v|--verbose                     Increase the verbosity of the program
                                   status messages. Use multiple times for more
                                   verbosity. Negates quiet.
  -q|--quiet                       Turn off all status messages for the
                                   program. Negated by verbose.
  --option-file <file>             Read more options and parameters from the
                                   file.
prefetch : 2.9.6

fastq-dumpのusageは次のよう。

$ fastq-dump --help
Usage:
  fastq-dump [options] <path> [<path>...]
  fastq-dump [options] <accession>
INPUT
  -A|--accession <accession>       Replaces accession derived from <path> in
                                   filename(s) and deflines (only for single
                                   table dump)
  --table <table-name>             Table name within cSRA object, default is
                                   "SEQUENCE"
PROCESSING
Read Splitting                     Sequence data may be used in raw form or
                                     split into individual reads
  --split-spot                     Split spots into individual reads
Full Spot Filters                  Applied to the full spot independently
                                     of --split-spot
  -N|--minSpotId <rowid>           Minimum spot id
  -X|--maxSpotId <rowid>           Maximum spot id
  --spot-groups <[list]>           Filter by SPOT_GROUP (member): name[,...]
  -W|--clip                        Remove adapter sequences from reads
Common Filters                     Applied to spots when --split-spot is not
                                     set, otherwise - to individual reads
  -M|--minReadLen <len>            Filter by sequence length >= <len>
  -R|--read-filter <[filter]>      Split into files by READ_FILTER value
                                   optionally filter by value:
                                   pass|reject|criteria|redacted
  -E|--qual-filter                 Filter used in early 1000 Genomes data: no
                                   sequences starting or ending with >= 10N
  --qual-filter-1                  Filter used in current 1000 Genomes data
Filters based on alignments        Filters are active when alignment
                                     data are present
  --aligned                        Dump only aligned sequences
  --unaligned                      Dump only unaligned sequences
  --aligned-region <name[:from-to]>  Filter by position on genome. Name can
                                   either be accession.version (ex:
                                   NC_000001.10) or file specific name (ex:
                                   "chr1" or "1"). "from" and "to" are 1-based
                                   coordinates
  --matepair-distance <from-to|unknown>  Filter by distance between matepairs.
                                   Use "unknown" to find matepairs split
                                   between the references. Use from-to to limit
                                   matepair distance on the same reference
Filters for individual reads       Applied only with --split-spot set
  --skip-technical                 Dump only biological reads
OUTPUT
  -O|--outdir <path>               Output directory, default is working
                                   directory '.' )
  -Z|--stdout                      Output to stdout, all split data become
                                   joined into single stream
  --gzip                           Compress output using gzip
  --bzip2                          Compress output using bzip2
Multiple File Options              Setting these options will produce more
                                     than 1 file, each of which will be suffixed
                                     according to splitting criteria.
  --split-files                    Dump each read into separate file.Files
                                   will receive suffix corresponding to read
                                   number
  --split-3                        Legacy 3-file splitting for mate-pairs:
                                   First biological reads satisfying dumping
                                   conditions are placed in files *_1.fastq and
                                   *_2.fastq If only one biological read is
                                   present it is placed in *.fastq Biological
                                   reads and above are ignored.
  -G|--spot-group                  Split into files by SPOT_GROUP (member name)
  -R|--read-filter <[filter]>      Split into files by READ_FILTER value
                                   optionally filter by value:
                                   pass|reject|criteria|redacted
  -T|--group-in-dirs               Split into subdirectories instead of files
  -K|--keep-empty-files            Do not delete empty files
FORMATTING
Sequence
  -C|--dumpcs <[cskey]>            Formats sequence using color space (default
                                   for SOLiD),"cskey" may be specified for
                                   translation
  -B|--dumpbase                    Formats sequence using base space (default
                                   for other than SOLiD).
Quality
  -Q|--offset <integer>            Offset to use for quality conversion,
                                   default is 33
  --fasta <[line width]>           FASTA only, no qualities, optional line
                                   wrap width (set to zero for no wrapping)
  --suppress-qual-for-cskey        suppress quality-value for cskey
Defline
  -F|--origfmt                     Defline contains only original sequence name
  -I|--readids                     Append read id after spot id as
                                   'accession.spot.readid' on defline
  --helicos                        Helicos style defline
  --defline-seq <fmt>              Defline format specification for sequence.
  --defline-qual <fmt>             Defline format specification for quality.
                                   <fmt> is string of characters and/or
                                   variables. The variables can be one of: $ac
                                   - accession, $si spot id, $sn spot
                                   name, $sg spot group (barcode), $sl spot
                                   length in bases, $ri read number, $rn
                                   read name, $rl read length in bases. '[]'
                                   could be used for an optional output: if
                                   all vars in [] yield empty values whole
                                   group is not printed. Empty value is empty
                                   string or for numeric variables. Ex:
                                   @$sn[_$rn]/$ri '_$rn' is omitted if name
                                   is empty
OTHER:
  --disable-multithreading         disable multithreading
  -h|--help                        Output brief explanation of program usage
  -V|--version                     Display the version of the program
  -L|--log-level <level>           Logging level as number or enum string One
                                   of (fatal|sys|int|err|warn|info) or (0-5)
                                   Current/default is warn
  -v|--verbose                     Increase the verbosity level of the program
                                   Use multiple times for more verbosity
  --ncbi_error_report              Control program execution environment
                                   report generation (if implemented). One of
                                   (never|error|always). Default is error
  --legacy-report                  use legacy style 'Written spots' for tool
fastq-dump : 2.9.6