What is BigQuery?

BigQuery is Google’s managed SQL-like database system that is designed to run at PB scales. It allows storing, querying, and manipulating absolutely huge datasets with little more effort (perhaps even less) than operating on a standard relational database. Cost is a pay-as-you-go model, based on the amount of data processed.

What is in BigQuery?

The Institute for Systems Biology (ISB) Cancer Genomics Cloud have loaded all of the TCGA level 3 data into BigQuery, enabling very facile slicing of data to a very fine granularity. Bulk data in file form is still likely to be the most appropriate way to do large-scale data mining and inference, but for gene- or sample-based queries or to join datasets in novel ways, BigQuery offers a possible solution.

Setup

library(bigrquery)

In BigQuery, a dataset is something like a database or a schema from mysql or postgresql. Listing them is straightforward using bigrquery directly.

bigrquery::list_datasets('isb-cgc')

## [1] "ccle_201602_alpha"  "genome_reference"   "platform_reference"
## [4] "tcga_201510_alpha"  "tcga_201607_beta"   "tcga_cohorts"      
## [7] "tcga_seq_metadata"

The first time you run this, you will be likely asked to login to a google billing account. This process will involve R kicking you over to a web browser (the first time) where you will need to login to your google cloud project. You will likely see this type of message….

Use a local file to cache OAuth access credentials between R sessions?
1: Yes
2: No

Selection: Yes
Adding ^\.httr-oauth$ to .Rbuildignore
Waiting for authentication in browser...
Press Esc/Ctrl + C to abort
Authentication complete.

Assuming that you have set up an account with the ISB Cancer Genomics Cloud Pilot, you can authorize R to access the account, and from then on will be able to operate without browser interaction.

bigrquery::list_tables("isb-cgc", "tcga_201607_beta")

##  [1] "Annotations"               "Biospecimen_data"         
##  [3] "Clinical_data"             "Copy_Number_segments"     
##  [5] "DNA_Methylation_betas"     "DNA_Methylation_chr1"     
##  [7] "DNA_Methylation_chr10"     "DNA_Methylation_chr11"    
##  [9] "DNA_Methylation_chr12"     "DNA_Methylation_chr13"    
## [11] "DNA_Methylation_chr14"     "DNA_Methylation_chr15"    
## [13] "DNA_Methylation_chr16"     "DNA_Methylation_chr17"    
## [15] "DNA_Methylation_chr18"     "DNA_Methylation_chr19"    
## [17] "DNA_Methylation_chr2"      "DNA_Methylation_chr20"    
## [19] "DNA_Methylation_chr21"     "DNA_Methylation_chr22"    
## [21] "DNA_Methylation_chr3"      "DNA_Methylation_chr4"     
## [23] "DNA_Methylation_chr5"      "DNA_Methylation_chr6"     
## [25] "DNA_Methylation_chr7"      "DNA_Methylation_chr8"     
## [27] "DNA_Methylation_chr9"      "DNA_Methylation_chrX"     
## [29] "DNA_Methylation_chrY"      "Protein_RPPA_data"        
## [31] "Somatic_Mutation_calls"    "mRNA_BCGSC_GA_RPKM"       
## [33] "mRNA_BCGSC_HiSeq_RPKM"     "mRNA_BCGSC_RPKM"          
## [35] "mRNA_UNC_GA_RSEM"          "mRNA_UNC_HiSeq_RSEM"      
## [37] "mRNA_UNC_RSEM"             "miRNA_BCGSC_GA_isoform"   
## [39] "miRNA_BCGSC_GA_mirna"      "miRNA_BCGSC_HiSeq_isoform"
## [41] "miRNA_BCGSC_HiSeq_mirna"   "miRNA_BCGSC_isoform"      
## [43] "miRNA_BCGSC_mirna"         "miRNA_Expression"

The dplyr package for bigquery

The dplyr database tutorial discusses how to use dplyr with Google BigQuery. We are going to use that approach here to simplify the interaction with the database.

library(dplyr)

Next, we set up a connection between dplyr and bigquery.

my_billing = "isb-cgc-01-0006" # replace billing info here with your own
bq = src_bigquery("isb-cgc", "tcga_201607_beta", billing = my_billing)

The available tables in the tcga_201510_alpha database are given by simply showing the bq object.

bq

## src:  bigquery [isb-cgc:tcga_201607_beta]
## tbls: Annotations, Biospecimen_data, Clinical_data, Copy_Number_segments,
##   DNA_Methylation_betas, DNA_Methylation_chr1, DNA_Methylation_chr10,
##   DNA_Methylation_chr11, DNA_Methylation_chr12, DNA_Methylation_chr13,
##   DNA_Methylation_chr14, DNA_Methylation_chr15, DNA_Methylation_chr16,
##   DNA_Methylation_chr17, DNA_Methylation_chr18, DNA_Methylation_chr19,
##   DNA_Methylation_chr2, DNA_Methylation_chr20, DNA_Methylation_chr21,
##   DNA_Methylation_chr22, DNA_Methylation_chr3, DNA_Methylation_chr4,
##   DNA_Methylation_chr5, DNA_Methylation_chr6, DNA_Methylation_chr7,
##   DNA_Methylation_chr8, DNA_Methylation_chr9, DNA_Methylation_chrX,
##   DNA_Methylation_chrY, miRNA_BCGSC_GA_isoform, miRNA_BCGSC_GA_mirna,
##   miRNA_BCGSC_HiSeq_isoform, miRNA_BCGSC_HiSeq_mirna, miRNA_BCGSC_isoform,
##   miRNA_BCGSC_mirna, miRNA_Expression, mRNA_BCGSC_GA_RPKM,
##   mRNA_BCGSC_HiSeq_RPKM, mRNA_BCGSC_RPKM, mRNA_UNC_GA_RSEM,
##   mRNA_UNC_HiSeq_RSEM, mRNA_UNC_RSEM, Protein_RPPA_data,
##   Somatic_Mutation_calls

To work with a table from BigQuery using dplyr, we need to set up a dplyr table connection, one for each table that we wish to use. Here, I do that with the Annotations table.

tbl_annot = tbl(bq,'Annotations')
# and show the columns
colnames(tbl_annot)

##  [1] "annotationId"             "annotationCategoryId"    
##  [3] "annotationCategoryName"   "annotationClassification"
##  [5] "annotationNoteText"       "Study"                   
##  [7] "itemTypeName"             "itemBarcode"             
##  [9] "AliquotBarcode"           "ParticipantBarcode"      
## [11] "SampleBarcode"            "dateAdded"               
## [13] "dateCreated"              "dateEdited"

Our dplyr table object behaves essentially like a data.frame when using the dplyr style of coding.

head(tbl_annot)

## Source:   query [?? x 14]
## Database: bigquery [isb-cgc:tcga_201607_beta]
## 
##    annotationId annotationCategoryId        annotationCategoryName
##           <int>                <int>                         <chr>
## 1           711                    1 Tumor tissue origin incorrect
## 2           713                    1 Tumor tissue origin incorrect
## 3           712                    1 Tumor tissue origin incorrect
## 4          2908                    1 Tumor tissue origin incorrect
## 5         19730                    1 Tumor tissue origin incorrect
## 6         15108                    2          Tumor type incorrect
## ..          ...                  ...                           ...
## Variables not shown: annotationClassification <chr>, annotationNoteText
##   <chr>, Study <chr>, itemTypeName <chr>, itemBarcode <chr>,
##   AliquotBarcode <chr>, ParticipantBarcode <chr>, SampleBarcode <chr>,
##   dateAdded <chr>, dateCreated <chr>, dateEdited <chr>.

Example use cases

tbl_mut = tbl(bq,'Somatic_Mutation_calls')

How many variants of each type are present for each disease type?

mut_counts = tbl_mut %>% 
  select(Study, Variant_Classification) %>% 
  group_by(Study,Variant_Classification) %>% 
  summarize(n = n()) %>% 
  arrange(Study,Variant_Classification) %>%
  collect()

mut_counts %>% tidyr::spread(Variant_Classification,n)

## Source: local data frame [33 x 24]
## Groups: Study [33]
## 
##    Study 3'UTR 5'Flank 5'UTR De_novo_Start_InFrame
##    <chr> <int>   <int> <int>                 <int>
## 1    ACC   186       2   171                    18
## 2   BLCA  1243       3  1536                   114
## 3   BRCA  2888       3  2011                    82
## 4   CESC  2246       3  1916                    57
## 5   CHOL    76      NA    78                     4
## 6   COAD  2566       9  1988                   165
## 7   DLBC   163      NA   148                     6
## 8   ESCA  2211       2  1206                    40
## 9    GBM   608       1   484                    54
## 10  HNSC  1057       3  1203                    63
## ..   ...   ...     ...   ...                   ...
## Variables not shown: De_novo_Start_OutOfFrame <int>, Frame_Shift_Del
##   <int>, Frame_Shift_Ins <int>, IGR <int>, In_Frame_Del <int>,
##   In_Frame_Ins <int>, Intron <int>, lincRNA <int>, Missense_Mutation
##   <int>, Nonsense_Mutation <int>, Nonstop_Mutation <int>, RNA <int>,
##   Silent <int>, Splice_Site <int>, Start_Codon_Del <int>, Start_Codon_Ins
##   <int>, Start_Codon_SNP <int>, Stop_Codon_Del <int>, Stop_Codon_Ins <int>.

sessionInfo()

## R version 3.3.1 (2016-06-21)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.6 (El Capitan)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] yaml_2.1.13                knitr_1.13                
##  [3] BiocStyle_2.0.2            dplyr_0.5.0               
##  [5] VariantAnnotation_1.18.1   Rsamtools_1.25.0          
##  [7] Biostrings_2.41.4          XVector_0.13.2            
##  [9] SummarizedExperiment_1.3.6 Biobase_2.33.0            
## [11] GenomicRanges_1.25.9       GenomeInfoDb_1.8.1        
## [13] IRanges_2.7.11             S4Vectors_0.11.7          
## [15] BiocGenerics_0.19.1        bigrquery_0.3.0           
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.5             highr_0.6              
##  [3] formatR_1.4             liftr_0.3              
##  [5] GenomicFeatures_1.24.3  objectProperties_0.6.5 
##  [7] bitops_1.0-6            tools_3.3.1            
##  [9] sevenbridges_1.2.4      zlibbioc_1.19.0        
## [11] biomaRt_2.29.2          digest_0.6.9           
## [13] docopt_0.4.5            uuid_0.1-2             
## [15] BSgenome_1.41.2         RSQLite_1.0.0          
## [17] jsonlite_1.0            evaluate_0.9           
## [19] tibble_1.0              DBI_0.4-1              
## [21] rstudioapi_0.6          curl_0.9.7             
## [23] rtracklayer_1.33.8      httr_1.2.1             
## [25] stringr_1.0.0           R6_2.1.2               
## [27] AnnotationDbi_1.35.3    XML_3.98-1.4           
## [29] BiocParallel_1.7.4      rmarkdown_0.9.6.9      
## [31] tidyr_0.5.1             magrittr_1.5           
## [33] GenomicAlignments_1.9.4 objectSignals_0.10.2   
## [35] htmltools_0.3.5         rsconnect_0.4.3        
## [37] assertthat_0.1          stringi_1.1.1          
## [39] lazyeval_0.2.0          openssl_0.9.4          
## [41] RCurl_1.95-4.8

Querying TCGA data using BigQuery