Helen Berman Data Curation Interview


Published in: Data & Analytics, Technology
  1. 1. DATA CURATION INSIGHTS HELEN BERMAN PROFESSOR DIRECTOR, RCSB PROTEIN DATA BANK BIG Big Data Public Private Forum “So what are the changes? A lot more data, higher complexity and there is an acceptance for the methodology that we use.” Introduction Helen M. Berman is the director of the RCSB Protein Data Bank— one of the member organizations of the Worldwide Protein Data Bank and a Board of Governors Professor of Chemistry and Chemical Biology at Rutgers University. A structural biologist, her work includes structural analysis of protein-nucleic acid complexes, and the role of water in molecular interactions. She is also the founder and director of the Nucleic Acid Database, and leads the Protein Structure Initiative Structural Genomics Knowledge Base. Edward Curry: Please tell us a little about yourself and your history in the industry, role at the organisation, past experience? Helen Berman: I have been involved in one way or the other with the establishment and running of the Protein Data Bank for more than 40 years since its founding. From the beginning, in 1966, we had a firm conviction in the value of such a big data resource, even though we couldn’t prove it yet. In 1971, the PDB was established at Brookhaven. I was spending a lot of time trying to convince people that this was a good idea, to collect this data and archive them. And then, we fast forward to 1998. In the 90’s there were several calls for proposals to manage the PDB and in 1998 I won the cooperative agreement to run the PDB in America and then, once that happened, it became very obvious that there was a worldwide interest in handling the data, not just in America. We just formed a consortium with EBI and the group in Osaka, Japan, to form a worldwide PDB and the purpose on that was to agree on standards for the data as a well as methods for processing, curating and distributing the data. So, we basically formed this organisation ten years ago so that we have worldwide agreement on all of this. The PDB began in 1971 with about seven structures, we will have 100,000 by next year and we have about 95,000 now. So this is structured data determined by structural biologists by X- Ray, NMR, Electron microscopy, and other methods as they evolve. Lately, I have mostly been involved with standardisation and curation of the data. Edward Curry: What is the biggest change going on in your industry at this time? Is Big Data playing a role in it? Helen Berman: Structure determination has definitely been transformed the most, as the traditional X-Ray crystallography has been joined by many new developments. This demands new ways of describing and representing the data. The big changes right now in terms of structural biology is that the structures are getting bigger and more complicated and that more methods are being used to solve the structures. After years of discussion, there is now a tremendous acceptance for our idea that this should all be done by defining our terms. So there used to be this idea that, if you do that you are going to get in the way of creativity, just collect the data and don’t worry about it. But we always worried about the metadata before other people worried about it. But now the good news is that everybody we are working with has appreciated that and they really have cooperated in every possible way to allow us to do that. That’s made a huge difference. So what are the changes? A lot more data, higher complexity and there is an acceptance for the methodology that we use.” Edward Curry: How has your job changed in the past 12 months? How do you expect it to change in the next 3–5 years? Helen Berman: In the last 12 months there has been an enormous growth of the community. More people in the community have now accepted the way in which we have done things. So we have driven very hard the concept of the data dictionary and now that’s well accepted. The software developers are using this dictionary, so that's a big thing. It took a long time to get that to be accepted. We started creating these dictionaries about 20 years ago. So that's the biggest change: it is the acceptance. The way in which my job will change in the next three to five years is that we can work on more complicated structures, figure out how to handle them. We can do more, we can work on more complicated structures. That part of convincing the community is no longer necessary. Edward Curry: What does data curation mean to you in your organisation? Helen Berman: We base all our data curation on our data dictionaries, it is where every term is defined. So our data dictionary now has 5,000 terms and we keep refining the dictionary, so it is very clear and we expanded as the methods come in. All of the tools that we developed and, with the world wide PDB, we are doing new _______
  2. 2. software that is completely based on this dictionary. The other issue is that we validate all the data based on community recognized standards. So the way we do that is, we have task forces that consist of experts in data validation and they come up with recommendations on how the data should be validated and then we implement it. Again, it's all community driven. In the wwPDB, the members review the curation procedures to see whether or not they are giving us the best representation of the data so we are constantly reviewing the entire archive to see whether or not it is consistent and, if we notice that there are inconsistencies, then we go back to see whether our validation procedures or curation procedures can be improved so that we can get data in a better shape. Edward Curry: What data do you curate? What is the size of dataset? How many users are involved in curating the data? Helen Berman: We curate three dimensional coordinates of biological macromolecules, whose structures have been determined using established methods. Right now those methods are X-rays, NMR and EM. We look at the coordinates, and we also look at the supporting data underneath that data. In the case of X-ray that could be structure factors, for example. We look at the maps from EM, we look from restraints and chemical shifts from NMRs and we will see whether the model matches the data. We have 100,000 structures and about 750,000 files and about 300 GB in storage. There are about 20 annotators worldwide who are working on curating the data. There are probably another ten or twelve people involved in software development. And those are the people who are actually processing the data and thousands of structural biologists who are submitting the data. Edward Curry: What are the uses of the curated data? What value does it give? Helen Berman: It is used by the researchers in molecular, structural, and computational biology and in pharmacology. The drug industry makes heavy use of it. It supports teaching biology students. We have more than 300,000,000 downloads of coordinate data in a year. The value of the data is that it helps to give insight of biological mechanisms and function. We look to see who uses the data and we noticed that mathematicians, statisticians, computer scientists explore new methodologies for their research. It is not biological research, but research which handles complex data, because each data file has about 500 data items associated with them, not counting the coordinates (the metadata is about 500). The dictionary has about 5,000 terms, of those around 500 are collected. As people are willing to deposit the data, we have all the terms in place so when they finally decide they want to give us certain kinds of data we can do it. Everything is built on this very structured framework. Edward Curry: What processes and technologies do you use for data curation? Any comments on the performance or design of the technologies? Helen Berman: The three centers that I have talked about had two different data processing pipelines. We use the same algorithms to process the data, but different programs. About five years ago, we engaged in a project to create a common tool for deposition and annotation. We all use the same everything. We use a workflow ____ manager, a web interface and modular system. That is completely portable and extensible. Now we have three data centers and we can have ten data centers with this methodology or 100 data centers. We’ve created a dictionary-driven deposition annotation system with a workflow manager that has all the rules and all the experience that we’ve gained in the many years we have been doing it and we have a pipeline for the data processing. Most it is done computationally and the role of the annotator or the curator is to check at the end of the pipeline whether the final data makes sense. As times goes on, in the annotation step, we have more educated curators because things that they have to look at are at a higher level, as we let computers handle all of the routine stuff, and we let them free to look at the actual meaning of the structure and to see whether or not that makes sense. And then the efficiency of the whole thing gets way better. So the number of structures per annotator has gone up. So we haven’t increased our number of annotators in the last ten years, even though the structures have gone up from 2,000/year to 10,000/year, but because we have better tools we can do that. We have created the software but we have borrowed from all the experts in the field, reusing existing software and expertise in the field. Edward Curry: What is your understanding of the term "Big Data"? Helen Berman: From my point of view, Big Data as it is defined now, includes datasets that are very large, complex and extremely noisy where the signal is very low and the job of the Big Data experts is to figure out ways to extract information from this very noisy data. Regarding the X-Ray crystallography pipeline, all the data comes out of the synchrotrons, this massive data, and we had no definition of terms, no methodologies, no algorithms developed, and then you throw it at a computer scientist and then you tell him, you figure out what this means. And then you would have to develop all kinds of methodologies to figure out the signal, which is the structure in our case. In the history of structural biology and X- ray crystallography, a few hundreds years ago, people saw crystals, and people asked why are they shinning? Why do they have straight edges? They then developed the whole thing called crystal systems based on just looking at these crystals, and it turns out that everything they developed by just looking at the morphology of the crystal is absolutely correct. Then as time went on, they developed something called international tables of crystallography, which organises the details of what is called space groups that define these crystal and there are 230 of them. So then we have these tables on how crystals can be organized and no one has ever demonstrated that is everything but what these tables have shown. In the 20th century, people decided they want to look at the structures of proteins so they started doing X-ray diffraction and over time they developed methodologies for taking diffraction patterns from crystals. So the diffraction patterns come from X-rays and then people developed algorithms which return structure from this data, which is very big. Over time, they figured out how to refine those structures. Before the PDB ever existed, the methods of going from the raw data to the crystal structures were well determined and then PDB come along because of the peoples’ background, there were a lot of crystallographers. Then the same thing happened: let’s define every term, let’s figure out exactly how we collect the data and get the metadata exactly right. “So basically from my point of view, Big Data as it is being defined now are datasets that are very large, complex and extremely noisy where the signal is very low and the job of the Big Data experts is to figure out ways to extract information from this very noisy data.”
  3. 3. Over time, this process became more sophisticated. The current PDB, what we have now is extraordinarily well-curated data. Because you have all those very compulsive people that have spent their entire careers figuring out how do you get the structure and how you curate the data. So our signal is very high and now we are up to the point of fussing about the signal. So do we have the standard deviation right for all the atoms? How do we describe the error properly? So from what I see when I try to understand why we are not Big Data, we are not Big Data because we’ve had a whole group of people focusing all their attention on defining the terms in ways that I’m not aware of one other field that has done that. The issue is the science, the technology and the community. The community has a mindset that says we must define our terms, we must be precise. Everybody who contributes to the PDB understands that. If you compare that with some newer fields where there are huge amounts of data, if you ask somebody in this field, how did you get to that data point. How many communities will do that? Not very many. That is that issue that I see. I actually think that there is huge value studying that. How do get a community to cooperate? Because we are told we are the gold standard for databases in Biology. It is really built on the backs of people who had this idea that you should define everything. What happened is basically one of the persons that I talked to said: “the problem with you guys is that you solved the problem, so you are not Big Data”. And then I say: maybe somebody would have to pay attention to what we did as a community, and take the lessons from that community and apply it. There is the technology that was applied in this process and there is the sociological perspective on how to build this community consensus. How do you get a community to actually cooperate? That wasn’t easy. There were times when people were fighting and screaming about all sorts of things, but in the end I think we had succeeded in doing something that I think another community could take this lesson and apply it, but collapsing it down into an year or two instead of 40 years. Edward Curry: What influences do you think "Big Data" will have on future of data curation? What are the technological demands of curation in the "Big Data" context? Helen Berman: What we focused on was figuring out how to develop ontologies and data models to describe the data. We spent a lot of time on that. That is a very the laborious process. And the other way is using natural language processing (NLP) and similar methods like pattern recognition and similarity functions to try to pull data out. I think that what really has to happen is to marry those two things. Don’t just do NLP, but do come up with a way to marry the sort of laborious dictionary and ontology development, with the NLP, and cross those two things, and I think from my point of view that what has to happen. I hope I am wrong but I don’t think that you can just throw modern computer science methodologies at this data without somehow capturing the controlled vocabularies and ontologies. I see those two communities not working together enough. It’s a big challenge. If you are going to solve Big Data problems you have to match the domain expertise with the computer science expertise. It would be important to bring a sociologist into the analysis of the problem as well. A sociologist can take a look on some successful and some unsuccessful cases. In terms of curation challenges, for example, we’ve got the structure of a large HIV capsid, a huge structure with 1,300 chains in it, and we had to figure out how to represent it in a mechanism that wasn’t used to handling this. Because we had been thinking about this ____ when we created our dictionary, we set it to have no limit on atoms or chains. In the end we did it. We are going to have more structures that are going to be determined by five different methods and we are going to have to put it all together, we have to figure out what the error limits are, a structure that is very small that has a large data/parameter ratio will be a lot easier to represent accurately than other structures that are going to be huge macromolecular machines that we are going to be able to handle. That’s where our biggest challenge is, and that’s why it’s fun. That’s modern science. Edward Curry: What data curation technologies will cope with "Big Data"? Helen Berman: As I said before how to bridge the two communities better (natural language processing and ontology engineering). We also need to be able to handle large volumes of data and be able to process that. We need to be able to decide how much data and which data we need to collect is very important. I think you need to think ahead every time: what am I going to get if I collect this data? We don’t do formal crowdsourcing. You want to make sure that what you are getting is the best possible curation. What we do crowdsourcing is by having these task forces whenever we have a problem to solve, we bring experts and we sit down and talk for a couple days and we say here are the questions we have, how do we deal with that. So bring in people in that way, but we don’t actually bring in people to process the data. Even with a very well trained annotator with good tools, it requires a lot of surveillance. About the BIG Project The BIG project aims to create a collaborative platform to address the challenges and discuss the opportunities offered by technologies that provide treatment of large volumes of data (Big Data) and its impact in the new economy. BIG‘s contributions will be crucial for both the industry and the scientific community, policy makers and the general public, since the management of large amounts of data play an increasingly important role in society and in the current economy. “Don’t just do natural language processing (NLP), but do come up with a way to marry the sort of laborious dictionary and ontology development, with the NLP.”
