APIs and Synthetic Biology

2,081 views

Published on

Description of the API concept for engineering and how it can be useful. Particularly how it should be used with respect to genomics data. Finally, an analogy of the API concept in synthetic biology and how evolution allows encapsulation.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,081
On SlideShare
0
From Embeds
0
Number of Embeds
95
Actions
Shares
0
Downloads
19
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • Jake asked for computational tools and for biology; try to give you some of both.

    “Industry talk” – devoid of original content

    Used to be like you (researcher).

    Immune repertoire sequencing with George Church.

    Learned data management technology at Cloudera.
    Share some insights.
  • Jake asked for computational tools and for biology; try to give you some of both.

    “Industry talk” – devoid of original content

    Used to be like you (researcher).

    Immune repertoire sequencing with George Church.

    Learned data management technology at Cloudera.
    Share some insights.
  • Jake asked for computational tools and for biology; try to give you some of both.

    “Industry talk” – devoid of original content

    Used to be like you (researcher).

    Immune repertoire sequencing with George Church.

    Learned data management technology at Cloudera.
    Share some insights.
  • Log scale.
  • Any assay that can be encoded in DNA is now high-throughput

    People working on this data aren’t always aware of the best tools out there.
  • Custom script; who knows how it’s managed.

    Doesn’t take advantage of possible optimizations in the data. Done manually.

    Processing custom file format that no one else can do. Custom parser.

    No support for automatically splitting the data and parallelizing.

    Have to run it on a machine with access to the file system
  • Declarative description of what I want.

    Abstracted away underlying store. Could be:
    Table in distributed file system like Hadoop
    Distributed In-memory data structure
    SQLlite file on local disk
    MySQL
    Postgres
    Can be sent to remote cluster.
  • Multiple possible implementations of size()
    Multiple add methods

    Array vs Linked Lit
  • Multiple possible implementations of size()
    Multiple add methods
  • GOTO SITE
  • GOTO SITE
  • Love to see APIs for:
    Accessing Ab/TCR sequences
    Accessing germline sequences
    Accessing immune locus information
    Accessing primer sequences
    Intersecting primer sequences with sequence databases
    Accessing MHC sequences
    MHC nomenclature conversion
    Immunogenetics ontology definition and service
    Immune receptor alignment and numbering
    Immune receptor numbering conversion service
    Immune receptor phylogeny
    Immune receptor structure predictions
    Accessing epitope database
  • In principle, they’ve done the work to support this type of stuff.
  • But…
  • Some people have proposed solving this problem
  • Accessing V-QUEST

    Horrible documentation. Required a bit of reverse engineering.

    Yelled at me for doing so.
  • Ideally the community would define the common set of endpoints that a user might expect.
  • Separately, genomics has converged.
  • Have to parse FORMAT before you can parse the actual genotype calls
    FORMAT/INFO fields customizable

    VCF records are dynamically typed. Classification as a SNP, Indel, Mixed, etc. depends on the properties of the alleles in the record.

    Entries for particular CHROM must be in a single block. Position must be sorted. Makes it hard to add variants.

    Number of rows is finite at length of the genome. But the records should scale according to the data type that grows, which is genotype calls. Difficult to add new samples.

    Text format. Relatively poor compression. Verbose. Must be parsed. Slower.

    Often Gzip-compressed – non-splittable.
  • VCF already better than the immune situation
  • One model of an aligned read
  • One model of an aligned read
  • GOTO SITE
  • Complain that binary is harder to read/process, but Avro/Thrift make that easy.
  • Enumeration of failure modes
  • Generation of diversity for the internal implementation.

    Simple input and output signals.
  • The right kind of diversity matters.
  • Viruses depend on API compatibility in order to infect
  • Matrix of possible Ab-Ag interactions.

    But not currently possible to get both at the same time.
  • We chose to get only the antibody information, with little functional information.

    GENETIC approach
  • Alternatively, and cleverly, go for the other half. This way, the functionality is still useful.
  • Joined a project with Steve Elledge, led by Ben Larman to discover autoantigens.

    Tile all human ORFs with peptides.

    Synthesize peptides and clone into phage.
  • Carl June chimeric receptors?

    Checkpoint blockade?

    Steroids/immunosuppresants?
  • APIs and Synthetic Biology

    1. 1. 1 The API Uri Laserson | @laserson | laserson@cloudera.com 21 May 2014
    2. 2. 2 The API, or how to make your computational collaborators love you Uri Laserson | @laserson | laserson@cloudera.com 21 May 2014
    3. 3. 3 The API, or how to make your computational collaborators love you, and also some perspectives on engineering biology and immunology Uri Laserson | @laserson | laserson@cloudera.com 21 May 2014
    4. 4. 4
    5. 5. NCBI Sequence Read Archive (SRA) 5 Today… 1.14 petabytes One year ago… 609 terabytes
    6. 6. For every “-ome” there’s a “-seq” Genome DNA-seq Transcriptome RNA-seq FRT-seq NET-seq Methylome Bisulfite-seq Immunome Immune-seq Proteome PhIP-seq Bind-n-seq
    7. 7. Crappy academic code 7 counts_dict = {} for chain in vdj.parse_VDJXML(inhandle): try: counts_dict[chain.junction] += 1 except KeyError: counts_dict[chain.junction] = 1 for count in counts_dict.itervalues(): print >>outhandle, np.int_(count)
    8. 8. Crappy academic code 8 counts_dict = {} for chain in vdj.parse_VDJXML(inhandle): try: counts_dict[chain.junction] += 1 except KeyError: counts_dict[chain.junction] = 1 for count in counts_dict.itervalues(): print >>outhandle, np.int_(count) SELECT count(*) FROM antibodies GROUP BY junction vs.
    9. 9. What is an API? 9
    10. 10. What is an API? • Application Programming Interface • Contract (between machines) • Specifications for: 1. Procedures and methods 2. Data structures/messages 10
    11. 11. Stripe API 11
    12. 12. Stripe API 12
    13. 13. Java API 13 public interface List<E> { int size(); boolean isEmpty(); boolean contains(Object o); boolean add(E e); void add(int index, E element); boolean remove(Object o); }
    14. 14. Python DB API v2.0 (PEP 249) 14 http://legacy.python.org/dev/peps/pep-0249/
    15. 15. Why use an API? • Encapsulation/interfaces/abstraction • Loose-coupling of components • Reusable services • Service-oriented architecture 15
    16. 16. Linked-In’s Loose Coupling Architecture 16
    17. 17. Linked-In’s Loose Coupling Architecture 17
    18. 18. 18 (If This Then That) Stitching APIs together https://ifttt.com/recipes#popular
    19. 19. 19
    20. 20. 20 IMGT
    21. 21. IMGT “Spec” 21 http://www.imgt.org/IMGTScientificChart/
    22. 22. IMGT’s API is an FTP site 22
    23. 23. IMGT does not have an API 23 def __initVQUESTform(self): # get form request = urllib2.Request( 'http://imgt.cines.fr/IMGT_vquest/vquest?livret=0&Option=humanIg') response = urllib2.urlopen(request) forms = ClientForm.ParseResponse(response, form_parser_class=ClientForm.XHTMLCompatibleFormParser, backwards_compat=False) response.close() form = forms[0] # fill out base part of form - Synthesis view with no extra options - TEXT form['l01p01c03'] = ['inline'] form['l01p01c07'] = ['2. Synthesis'] form['l01p01c05'] = ['TEXT'] # may need to be 'TEXT' form['l01p01c09'] = ['60'] form['l01p01c35'] = ['F+ORF+ in-frame P'] form['l01p01c36'] = ['0'] form['l01p01c40'] = ['1'] # ['1'] for searching with indels form['l01p01c25'] = ['default’] ...
    24. 24. Haussler and genomics services 24
    25. 25. Google Genomics API 25
    26. 26. Google Genomics API 26
    27. 27. Flask/Bottle web server example 27 @route("/receptor/<id>") def lookup_receptor(id): # get the raw read @route("/sample/<sample_id>") def sample_summary(sample_id): # impl for getting sample information; can return: # * summary of repertoire information # (num reads, VDJ distribution, etc.) # * demographic info @route("/sample/<sample_id>/common_junctions") def common_junctions(sample_id): # impl for getting the most common CDR3s
    28. 28. Genomics ETL has converged on standards 28 .fastq .bam .vcf short read alignment genotype calling analysisbiochemistry
    29. 29. VCF 29 ##fileformat=VCFv4.1 ##fileDate=20090805 ##source=myImputationProgramV3.1 ##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta ##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,spe ##phasing=partial ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency"> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele"> ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129"> ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership"> ##FILTER=<ID=q10,Description="Quality below 10"> ##FILTER=<ID=s50,Description="Less than 50% of samples have data"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality"> #CHR POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 20 14370 rs605 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1: 20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3: 20 1110696 rs604 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.6 GT:GQ:DP:HQ 1|2:21:6: 20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP 0|0:54:7:56,
    30. 30. What about immune data? 30 .fastq .bam .vcf short read alignment genotype calling analysisbiochemistry .???immune receptor alignment
    31. 31. Multiple models for same types: VDJFasta 31 sub new { my ($class) = @_; my $self = {}; $self->{filename} = ""; $self->{headers} = []; $self->{sequence} = []; $self->{germline} = []; $self->{nseqs} = 0; $self->{mids} = {}; $self->{accVsegQstart} = {}; # example: 124 $self->{accVsegQend} = {}; # example: 417 $self->{accJsegQstart} = {}; $self->{accJsegQend} = {}; $self->{accDsegQstart} = {};
    32. 32. Multiple models for same types: vdj 32 class ImmuneChain(SeqRecord): def cdr3(self): return len(self.junction) def num_mutations(self): aln = self.letter_annotations['alignment'] return aln.count('S') + aln.count('I') def v(self): return self.__getattribute__('V-REGION') .qualifiers['allele'][0] def v_seq(self): return self.__getattribute__('V-REGION') .extract(self.seq.tostring())
    33. 33. 33 Interoperability/services depend on being able to communicated data
    34. 34. CSV 34 9 CCTG_PRCONS=IGHC1_R1_IGM unproductive Homsap IGHV5-51*01 F, or Homsap IGHV5-51*0 12 GGGG_PRCONS=IGHC3_R1_IGA productive Homsap IGHV3-11*01 F Homsap IGHJ1*01 F 13 CTTC_PRCONS=IGHC5_R1_IGG unproductive Homsap IGHV1-2*02 F Homsap IGHJ5*02 F 18 ACTT_PRCONS=IGHC3_R1_IGA productive Homsap IGKV3-15*01 F, or Homsap IGKV3D-15* 20 GGAC_PRCONS=IGHC5_R1_IGG productive Homsap IGHV4-61*02 F Homsap IGHJ4*02 F 25 TCGT_PRCONS=IGHC2_R1_IGD productive Homsap IGHV3-23*01 F, or Homsap IGHV3-23*0 26 GGTG_PRCONS=IGHC5_R1_IGG productive Homsap IGHV4-34*01 F, or Homsap IGHV4-34*0 28 GTGA_PRCONS=IGHC5_R1_IGG productive Homsap IGHV1-46*01 F, or Homsap IGHV1-46*0 31 ACCC_PRCONS=IGHC1_R1_IGM productive Homsap IGHV3-9*01 F, or Homsap IGHV3-9*02 36 GCAA_PRCONS=IGHC1_R1_IGM productive Homsap IGHV3-9*01 F, or Homsap IGHV3-9*02 39 GCAA_PRCONS=IGHC1_R1_IGM productive Homsap IGHV3-7*01 F Homsap IGHJ6*02 F 40 GGGT_PRCONS=IGHC1_R1_IGM productive Homsap IGHV4-34*01 F, or Homsap IGHV4-34*0 42 TAGG_PRCONS=IGHC5_R1_IGG productive Homsap IGHV4-39*01 F, or Homsap IGHV4-39*0 47 CAAA_PRCONS=IGHC1_R1_IGM productive Homsap IGHV3-15*01 F, or Homsap IGHV3-15*0 48 AGAA_PRCONS=IGHC5_R1_IGG unproductive Homsap IGHV3-30*04 F, or Homsap IGHV3-30-3 52 GCAG_PRCONS=IGHC1_R1_IGM productive Homsap IGHV3-23*01 F, or Homsap IGHV3-23*0 53 AACC_PRCONS=IGHC3_R1_IGA productive Homsap IGHV3-30*02 F Homsap IGHJ4*02 F
    35. 35. XML 35 <ImmuneChain> <c>IGHD</c> <barcode>RL014</barcode> <j_start_idx>389</j_start_idx> <seq>TTGTGGCTATTTTAAA ... CTCGGACT</seq> <descr>003699_0091_0140</descr> <tag>coding</tag> <clone>IGHV3-43_IGHJ4|387</clone> <j>IGHJ4*02</j> <v_end_idx>314</v_end_idx> <v>IGHV3-43*01</v> <junction>TGTGCAAAAGATAATCT ... TCTTTGACTACTGG</junction> <d>IGHD5-24*01</d> </ImmuneChain>
    36. 36. JSON 36 { "v": "IGHV4-39*02", "seq": "CCTATCCCCCTGTGTGCCTT ... CTCCACCAAG", "num_mutations": 43, "name": "HG2DXMN01CY8UH", "letter_annotations": { "alignment": "..............S....S....3333333333333333.. }, "junction_nt": "GCGAGGGGCCGATGGGACTTTTATTACATGGACGTC", "j": "IGHJ6*03", "annotations": { "usearch_90_cluster": "6277", "experiment_date": "20120119", "donor": "17517", "sample_type": "memory_B_cells", "source": "SeqWright", "tags": ["revcomp", "coding"], "taxonomy": [] }, "d": "IGHD3-10*01", http://www.json.org/
    37. 37. JSON 37 { "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c6172308 { "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c6172308 { "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c6172308 { "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c6172308 { "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c6172308 { "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c6172308 { "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c6172308 { "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c6172308 { "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c6172308 { "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c6172308 { "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c6172308 { "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c6172308 { "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c6172308 { "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c6172308 { "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c6172308
    38. 38. Binary formats • Protobuf, Thrift, or Avro • Flexible data model • All common primitive types (e.g. int, double string) • Support nested types, including arrays and maps • Efficient binary encoding • Code generation for many languages (binary compatible) • Support for schema evolution • Support IDL for data types and services 38
    39. 39. Thrift example: Twitter 39 service Twitter { void ping(); bool postTweet(1:Tweet tweet); TweetSearchResult searchTweets(1:string query); } struct Tweet { 1: required i32 userId; 2: required string userName; 3: required string text; 4: optional Location loc; 16: optional string language = "english" }
    40. 40. Thrift example: Immune receptor 40 cd ~/repos/kiwi thrift --gen java kiwi-format/src/main/resources/thrift/kiwi.thrift thrift --gen py:new_style kiwi-format/src/main/resources/thrift/kiwi.thrift See: https://github.com/laserson/kiwi
    41. 41. 41 Questions?
    42. 42. 42 Biological parts specifications • Library of parts with well-characterized input-output characteristics • In total, similar to API spec Canton, Nat. Biotech. 26: 787 (2008)
    43. 43. Engineering signaling pathways at inputs/outputs 43 Lim, Nat. Rev. Mol. Cell 11: 393 (2010)
    44. 44. Bottom-up genetic circuit design 44 Brophy, Nature Meth. 11: 508 (2014)
    45. 45. Bottom-up genetic circuit design 45 Brophy, Nature Meth. 11: 508 (2014)
    46. 46. Predict composability of genetic elements 46 Kosuri, PNAS 110: 14024 (2013) • 114 promoters x 111 RBS “…rather than relying on prediction or standardization, we can screen synthetic libraries for desired behavior.”
    47. 47. 47 Most addressable Cheapest to create ZFN => TALEN => CRISPR/Cas Least addressable Most expensive to create
    48. 48. Addressability for precision nanoscale engineering 48 Douglas, NAR 37: 5001(2009)
    49. 49. Addressability for precision nanoscale engineering 49 Douglas, Nature 459: 414 (2009)
    50. 50. Evolution for encapsulation: an evolved electronic thermometer 50 http://www.genetic-programming.com/hc/thermometer.html
    51. 51. Lycopene synthesis optimization 51 Wang, Nature 460: 894 (2009)
    52. 52. Evolutionary encapsulation for signaling pathway engineering 52 Peisajovich, Science 328: 368 (2010)
    53. 53. Evolutionary encapsulation for signaling pathway engineering 53 Peisajovich, Science 328: 368 (2010)
    54. 54. Genetic isolation with Re.coli 54 Lajoie, Science 342: 357 (2013)
    55. 55. So far, we discussed antibody-only data analysis
    56. 56. Antigen-only data generation
    57. 57. Larman, Nat. Biotech. 29: 535 (2011) Ben Larman Steve Elledge Agilent OLS array
    58. 58. 59 Phage immunoprecipitation sequencing (PhIP-seq)
    59. 59. 60 Patient A Replica 1 PatientAReplica2 SAPK4 NOVA1 TGIF2LX log10(-log10 P-value) PhIP-seq proof-of-principle
    60. 60. 61 ‘Forward vaccinology’
    61. 61. 62 ‘Reverse vaccinology’
    62. 62. 63 ‘Immunization without vaccination’
    63. 63. Encapsulation for cancer immunotherapy through TMG processing 64 Tran, Science 344: 641 (2014)
    64. 64. 65 Other examples?
    65. 65. Conclusions • The API perspective helps organize and communicate data • Use sane file formats if possible: • JSON for lightweight work • Thrift/Avro for heavyweight serialization/communication • Decouple data modeling for implementation details • Biological engineering: what abstractions are available? • Evolution as nature’s encapsulator 66
    66. 66. 67

    ×