BigQuery is Google’s managed SQL-like database system that is designed to run at PB scales. It allows storing, querying, and manipulating absolutely huge datasets with little more effort (perhaps even less) than operating on a standard relational database. Cost is a pay-as-you-go model, based on the amount of data processed.
The Institute for Systems Biology (ISB) Cancer Genomics Cloud have loaded all of the TCGA level 3 data into BigQuery, enabling very facile slicing of data to a very fine granularity. Bulk data in file form is still likely to be the most appropriate way to do large-scale data mining and inference, but for gene- or sample-based queries or to join datasets in novel ways, BigQuery offers a possible solution.
library(bigrquery)
In BigQuery, a dataset
is something like a database
or a schema
from mysql or postgresql. Listing them is straightforward using bigrquery directly.
bigrquery::list_datasets('isb-cgc')
## [1] "ccle_201602_alpha" "genome_reference" "platform_reference"
## [4] "tcga_201510_alpha" "tcga_201607_beta" "tcga_cohorts"
## [7] "tcga_seq_metadata"
The first time you run this, you will be likely asked to login to a google billing account. This process will involve R kicking you over to a web browser (the first time) where you will need to login to your google cloud project. You will likely see this type of message….
Use a local file to cache OAuth access credentials between R sessions?
1: Yes
2: No
Selection: Yes
Adding ^\.httr-oauth$ to .Rbuildignore
Waiting for authentication in browser...
Press Esc/Ctrl + C to abort
Authentication complete.
Assuming that you have set up an account with the ISB Cancer Genomics Cloud Pilot, you can authorize R to access the account, and from then on will be able to operate without browser interaction.
bigrquery::list_tables("isb-cgc", "tcga_201607_beta")
## [1] "Annotations" "Biospecimen_data"
## [3] "Clinical_data" "Copy_Number_segments"
## [5] "DNA_Methylation_betas" "DNA_Methylation_chr1"
## [7] "DNA_Methylation_chr10" "DNA_Methylation_chr11"
## [9] "DNA_Methylation_chr12" "DNA_Methylation_chr13"
## [11] "DNA_Methylation_chr14" "DNA_Methylation_chr15"
## [13] "DNA_Methylation_chr16" "DNA_Methylation_chr17"
## [15] "DNA_Methylation_chr18" "DNA_Methylation_chr19"
## [17] "DNA_Methylation_chr2" "DNA_Methylation_chr20"
## [19] "DNA_Methylation_chr21" "DNA_Methylation_chr22"
## [21] "DNA_Methylation_chr3" "DNA_Methylation_chr4"
## [23] "DNA_Methylation_chr5" "DNA_Methylation_chr6"
## [25] "DNA_Methylation_chr7" "DNA_Methylation_chr8"
## [27] "DNA_Methylation_chr9" "DNA_Methylation_chrX"
## [29] "DNA_Methylation_chrY" "Protein_RPPA_data"
## [31] "Somatic_Mutation_calls" "mRNA_BCGSC_GA_RPKM"
## [33] "mRNA_BCGSC_HiSeq_RPKM" "mRNA_BCGSC_RPKM"
## [35] "mRNA_UNC_GA_RSEM" "mRNA_UNC_HiSeq_RSEM"
## [37] "mRNA_UNC_RSEM" "miRNA_BCGSC_GA_isoform"
## [39] "miRNA_BCGSC_GA_mirna" "miRNA_BCGSC_HiSeq_isoform"
## [41] "miRNA_BCGSC_HiSeq_mirna" "miRNA_BCGSC_isoform"
## [43] "miRNA_BCGSC_mirna" "miRNA_Expression"
The dplyr database tutorial discusses how to use dplyr
with Google BigQuery. We are going to use that approach here to simplify the interaction with the database.
library(dplyr)
Next, we set up a connection between dplyr
and bigquery
.
my_billing = "isb-cgc-01-0006" # replace billing info here with your own
bq = src_bigquery("isb-cgc", "tcga_201607_beta", billing = my_billing)
The available tables in the tcga_201510_alpha
database are given by simply showing the bq
object.
bq
## src: bigquery [isb-cgc:tcga_201607_beta]
## tbls: Annotations, Biospecimen_data, Clinical_data, Copy_Number_segments,
## DNA_Methylation_betas, DNA_Methylation_chr1, DNA_Methylation_chr10,
## DNA_Methylation_chr11, DNA_Methylation_chr12, DNA_Methylation_chr13,
## DNA_Methylation_chr14, DNA_Methylation_chr15, DNA_Methylation_chr16,
## DNA_Methylation_chr17, DNA_Methylation_chr18, DNA_Methylation_chr19,
## DNA_Methylation_chr2, DNA_Methylation_chr20, DNA_Methylation_chr21,
## DNA_Methylation_chr22, DNA_Methylation_chr3, DNA_Methylation_chr4,
## DNA_Methylation_chr5, DNA_Methylation_chr6, DNA_Methylation_chr7,
## DNA_Methylation_chr8, DNA_Methylation_chr9, DNA_Methylation_chrX,
## DNA_Methylation_chrY, miRNA_BCGSC_GA_isoform, miRNA_BCGSC_GA_mirna,
## miRNA_BCGSC_HiSeq_isoform, miRNA_BCGSC_HiSeq_mirna, miRNA_BCGSC_isoform,
## miRNA_BCGSC_mirna, miRNA_Expression, mRNA_BCGSC_GA_RPKM,
## mRNA_BCGSC_HiSeq_RPKM, mRNA_BCGSC_RPKM, mRNA_UNC_GA_RSEM,
## mRNA_UNC_HiSeq_RSEM, mRNA_UNC_RSEM, Protein_RPPA_data,
## Somatic_Mutation_calls
To work with a table from BigQuery using dplyr
, we need to set up a dplyr
table connection, one for each table that we wish to use. Here, I do that with the Annotations
table.
tbl_annot = tbl(bq,'Annotations')
# and show the columns
colnames(tbl_annot)
## [1] "annotationId" "annotationCategoryId"
## [3] "annotationCategoryName" "annotationClassification"
## [5] "annotationNoteText" "Study"
## [7] "itemTypeName" "itemBarcode"
## [9] "AliquotBarcode" "ParticipantBarcode"
## [11] "SampleBarcode" "dateAdded"
## [13] "dateCreated" "dateEdited"
Our dplyr
table object behaves essentially like a data.frame when using the dplyr
style of coding.
head(tbl_annot)
## Source: query [?? x 14]
## Database: bigquery [isb-cgc:tcga_201607_beta]
##
## annotationId annotationCategoryId annotationCategoryName
## <int> <int> <chr>
## 1 711 1 Tumor tissue origin incorrect
## 2 713 1 Tumor tissue origin incorrect
## 3 712 1 Tumor tissue origin incorrect
## 4 2908 1 Tumor tissue origin incorrect
## 5 19730 1 Tumor tissue origin incorrect
## 6 15108 2 Tumor type incorrect
## .. ... ... ...
## Variables not shown: annotationClassification <chr>, annotationNoteText
## <chr>, Study <chr>, itemTypeName <chr>, itemBarcode <chr>,
## AliquotBarcode <chr>, ParticipantBarcode <chr>, SampleBarcode <chr>,
## dateAdded <chr>, dateCreated <chr>, dateEdited <chr>.
tbl_mut = tbl(bq,'Somatic_Mutation_calls')
mut_counts = tbl_mut %>%
select(Study, Variant_Classification) %>%
group_by(Study,Variant_Classification) %>%
summarize(n = n()) %>%
arrange(Study,Variant_Classification) %>%
collect()
mut_counts %>% tidyr::spread(Variant_Classification,n)
## Source: local data frame [33 x 24]
## Groups: Study [33]
##
## Study 3'UTR 5'Flank 5'UTR De_novo_Start_InFrame
## <chr> <int> <int> <int> <int>
## 1 ACC 186 2 171 18
## 2 BLCA 1243 3 1536 114
## 3 BRCA 2888 3 2011 82
## 4 CESC 2246 3 1916 57
## 5 CHOL 76 NA 78 4
## 6 COAD 2566 9 1988 165
## 7 DLBC 163 NA 148 6
## 8 ESCA 2211 2 1206 40
## 9 GBM 608 1 484 54
## 10 HNSC 1057 3 1203 63
## .. ... ... ... ... ...
## Variables not shown: De_novo_Start_OutOfFrame <int>, Frame_Shift_Del
## <int>, Frame_Shift_Ins <int>, IGR <int>, In_Frame_Del <int>,
## In_Frame_Ins <int>, Intron <int>, lincRNA <int>, Missense_Mutation
## <int>, Nonsense_Mutation <int>, Nonstop_Mutation <int>, RNA <int>,
## Silent <int>, Splice_Site <int>, Start_Codon_Del <int>, Start_Codon_Ins
## <int>, Start_Codon_SNP <int>, Stop_Codon_Del <int>, Stop_Codon_Ins <int>.
sessionInfo()
## R version 3.3.1 (2016-06-21)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.6 (El Capitan)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] yaml_2.1.13 knitr_1.13
## [3] BiocStyle_2.0.2 dplyr_0.5.0
## [5] VariantAnnotation_1.18.1 Rsamtools_1.25.0
## [7] Biostrings_2.41.4 XVector_0.13.2
## [9] SummarizedExperiment_1.3.6 Biobase_2.33.0
## [11] GenomicRanges_1.25.9 GenomeInfoDb_1.8.1
## [13] IRanges_2.7.11 S4Vectors_0.11.7
## [15] BiocGenerics_0.19.1 bigrquery_0.3.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.5 highr_0.6
## [3] formatR_1.4 liftr_0.3
## [5] GenomicFeatures_1.24.3 objectProperties_0.6.5
## [7] bitops_1.0-6 tools_3.3.1
## [9] sevenbridges_1.2.4 zlibbioc_1.19.0
## [11] biomaRt_2.29.2 digest_0.6.9
## [13] docopt_0.4.5 uuid_0.1-2
## [15] BSgenome_1.41.2 RSQLite_1.0.0
## [17] jsonlite_1.0 evaluate_0.9
## [19] tibble_1.0 DBI_0.4-1
## [21] rstudioapi_0.6 curl_0.9.7
## [23] rtracklayer_1.33.8 httr_1.2.1
## [25] stringr_1.0.0 R6_2.1.2
## [27] AnnotationDbi_1.35.3 XML_3.98-1.4
## [29] BiocParallel_1.7.4 rmarkdown_0.9.6.9
## [31] tidyr_0.5.1 magrittr_1.5
## [33] GenomicAlignments_1.9.4 objectSignals_0.10.2
## [35] htmltools_0.3.5 rsconnect_0.4.3
## [37] assertthat_0.1 stringi_1.1.1
## [39] lazyeval_0.2.0 openssl_0.9.4
## [41] RCurl_1.95-4.8