Increasing the Semantic Capabilities of Your ELN to Capture More Complete Data

Add bookmark

Jeremy Frey

Jeremy Frey, Professor of Physical Chemistry at the University of Southampton, UK, joins John Trigg, Director of phaseFour Informatics, to discuss Increasing the semantic capabilities of your ELN to capture more complete data. To listen to the full interview go to Do You Know How to Integrate the Semantic Capabilities of Your ELN?

J Trigg: Just to start, I’d like you to briefly introduce yourself and maybe explain your interests in ELNs.

J Frey: Yes, my background is a physical chemist. My research involves the use of lasers to investigate a wide variety of chemical problems, and as such I work with many of my chemistry colleagues from other parts of the discipline, and physicists, and increasingly with computer scientists. My interest in the whole problem of ELN came up from our desire to be able to reproduce our research without having to redo it, and realising that in many cases the records that we were keeping of our research were not as complete as we would like, and certainly not as accessible or shareable as we would want. I joined in with the UK e-Science Research team that started about ten years ago where the whole desire was to use modern technology, the web, semantics and many other aspects of digital technology to improve the reliability and the reusability of research and research data. And for chemists, the starting point of this is often the laboratory. No books, so this becomes an electronic laboratory notebook and I’ve been interested to see just what we could do with that and what the future will bring and what advantages it will bring.

J Trigg: The first question really touches on the word, semantics, which you used in your introduction there. I guess the term has crept into our vocabulary over the past few years but I’m particularly interested to know how you see the role of semantics in the context of electronic laboratory notebooks and laboratory information management.

J Frey: Perhaps in some ways semantics has just become technical jargon for what we all know we all ought to have done and that is keeping a proper record in context and it’s this context that is important. And sometimes that gets referred to as metadata and if that metadata gets properly structured in a way that can be readily understood and will be particularly readily understood by an automatic computational process rather than just by humans, that becomes called semantics. At one level what we really should be talking about is keeping a proper record with all this information in there.

I often quote, when we’re teaching, that if you ask a student to keep a record of something in the laboratory, the first thing they’ll write down is something like 25, meaning the temperature of the room, but they will not necessarily include the units and they won’t include a description of what the temperature is or what was being measured, and as they learn to make better and better records, that context will be more and more complete. And if you define some guidelines for that context, so we get the meanings of words in there, things like the temperature, what scale does it refer to and what are you expecting, then we start to get more and more formal semantics which can lead, in the limit, to a whole community agreed anthology which means that those concepts are referenceable back to a defined, more than just a dictionary, and that means that the automatic processes can go looking for specific things and to some extent, in quote, understand the information that is there because the context is so much more complete.

J Trigg: That’s interesting because quite clearly it makes a lot of common sense but what are the implications when we start to look at semantics in terms of, say, whether we have a common laboratory language? Do we all mean the same thing when we use the same words?

J Frey: I think it’s a really serious problem because we often don’t mean the same thing and the word itself has to be in a context, so in an organic chemistry laboratory the word may have one context and even in an analytical laboratory it may have a different context. That different context may mean it has a different meaning. And certainly for the traceability issues that an analytical chemist will wish to deal with, a much more extensive chain of prominence is required that might be done in other types of research laboratories. So we don’t have as common a language as we like to think and most of us code with this because we do it in terms that we understand it from our experience as a chemist and switch the context and we’re very good at working out what it ought to mean, what it must mean as we follow on through a description. But the problem is if you’re now trying to search automatically other people’s records and experiments, it’s much more difficult to create that search. And we have the whole problem of an attempt at natural language searching which is really very difficult for computer scientists in the whole artificial intelligence. If we can agree on the vocabulary of the context, then we cut through some of that and actually seriously understand what we were trying to say and be very explicit. But it is quite tedious if you ever have been involved in the user requirements capture and a set of users talking to a set of computer engineers who have got to make the software, the details which you have to go through and get the users to think about what they really mean can be excruciating at some point. But once it’s done and agreed, then it does open up an awful lot of interaction and interdisciplinary interaction as well which is very valuable.

J Trigg: I think that the laboratory environment is probably not a particularly good track record in standardising around a lot of aspects of the data and information that it uses so this would indeed present a very serious challenge. Are there any other particular challenges in really trying to adopt a semantic approach around laboratory information management?

J Frey: I think the problem is; it really just follows on from what we were talking about. There is the scale of the different vocabularies that are in use, the different pieces of equipment and the different types of information to be exchanged. So even if you start working up a good semantic description, there is so much to do that it makes it really very difficult to get agreement. We have entrenched interest from different suppliers. I don’t just mean the equipment suppliers; I mean people in research groups that like to do things their way. But there is also the whole issue that in some sense there’s an assumption behind all that we’re doing that people wish to exchange and communicate information and I think there are quite a number of social barriers to that and it depends where you are in the industrial and academic research chain, about whether it’s to your advantage to exchange information or whether you want to keep it quiet. In a lot of cases it’s necessary to exchange it amongst a local team and even that can be quite tricky, and other cases it’s necessary to pass on information up and ultimately to various authorities. So there are a lot of different demands on you as to the form the information should take and whether it should be private or secret – in fact, secret is overstating the case – but these sociological issues are just as important as the technical issues.

J Trigg: I was tempted to ask which of those two elements you saw as a major hurdle.

J Frey: I don’t know. I think we can get some glimpses of this by looking at the adoption of sharing of data in other disciplines. And if we take the bioinformatics community and the biological community, obviously in the biology side there was a lot more work on systematic naming of things which perhaps inspired that community to thinking in this way better, and perhaps because of the large governmental funding into the bioinformatics area, there’s been a lot more open data and it seems a lot more willingness to share data and methodology, and that’s been much more adopted. We can take the physics community which adopted the archive, the pre-print service very rapidly and showed some interest in sharing that information. The astronomy community which had agreements of the type that, data collected on telescopes is available to you, the original user for about 18 months and thereafter is available to research. It leads to things like virtual telescopes and so on being created.
[inlinead]
The chemistry community has been somewhat more concerned about sharing data and some of this is perhaps understandable. There are commercial interests where we all think we may be able to make some money out of our discoveries and we may be able to make more out of it if we keep a very tight control of our data. So we can see that in some communities the social structure was a bit different and we’ve seen more openness and sharing. Whether that data is shared in such a way that it’s easy to reuse, ie there was still semantics in there, is still an issue but at least some of the social constraints were less than in chemistry. In chemistry I think we have a very traditional community and people are somewhat wary of change and sharing.

J Trigg: Do you think that in order to make any progress here, this would need to be a community effort across the industry rather than trying to solve a problem within a particular organisation?

J Frey: I think the times when we can afford to solve it in an individual organisation have gone. I think the cost of setting up this type of infrastructure, just in the sheer time needed to agree on things, makes this prohibitive. If we can agree on all the terminology and the semantics and structure and so on, then writing the software on top of that is still an immense task but you don’t want to build it on shifting sand so you really need this agreement. And I think we all would benefit by having that basis sorted out. It’s a sort of precompetitive requirement and I think that really is going to be the only way to agree. And perhaps this is where some of the international bodies should be taking a greater lead with the understanding that this is not a pure research effort. This is academic and commercial interests here that need to come together and it’s got to satisfy all of them. One would hope that by adopting these things it will also reduce the cost of innovating new equipment and so on and you won’t have to spend so much time writing quite so much software because some of the standards will be out there. You don’t have to invent it all yourself. And we know that really using, in the Tolmera vocabulary, plagiarism is much better than trying to deal with reinventing it yourself.

J Trigg: If you see it that way that it’s an industry issue, is the implication then that we need to tackle the semantics element initially to develop this common language before we could actually start to work on other issues in the industry, for example data interchange standards?

J Frey: I think if one has a good understanding of the semantics, and I think that the technology to support that sort of semantic level has now matured to a much higher level so it can support the type of discussions needed, if we have that it will be much easier to write the data interchange standards because the standard won’t be of the form, column one of this file should be this and column two, that. You will be able to say things like column one is a temperature of a reactor or so on. You will be able to describe these things in such a way that the next level of software can find what it needs rather than being very tied down to a format. So you will describe the data and the information it contains rather than the detailed formats. And I think that will just make it much easier for people to write this and exchange the information. And when you do exchange that information, it will be a lot easier to follow a prominence trail backwards to find out the true origins of the data and I think increasingly people are going to be very aware of that need for regulatory and other requirements. And I think one of the best things that summarises some of that, the whole climategate issue that the UEA got involved in. The real issue was summarised by the BBC website that said show your workings, which is what we were all told at school. But what you need is an ability to show your workings easily and that’s where the prominence chain has got to be available automatically. That really implies a proper set of semantics. That will make the whole issue of data interchange radically easier once we’ve agreed on that.

J Trigg: That’s some really good background but I believe that you have actually been putting some practical work together in pursuing this particular line. Is that something you are able to talk about now?

J Frey: Yes. We’ve taken two approaches. We started some work at the beginning of our time working in the e-Science programme where we started with a very detailed semantic description of, tried to describe a synthetic process. We did this originally with an analogy so that we could get the computer scientists and the chemists talking together somewhere safe and we were making cups of tea. It’s very interesting how many details get left out, the number of people who, if you follow their recipe you would never have taken the teabag out of the cup. And the whole issue, of course, is the order of adding tea and milk and things like that. So we had to develop a system to describe that and we were using the language of the semantic web, the resource description framework language, RDF. But when we started, the language and our understanding of it was very incomplete and we came across a number of problems, many of which we think we can now solve. And so we have a way of describing processes, perhaps in the same way as the chemical engineers would already have described processes but formalised in terms of the semantic web language. We found that quite interesting. We could build software on top of that to both plan and guide the research and allow people to make notes and change things as they were going along. Change, of course we mean recording what you actually saw against the plan. Just as a side note, there’s an interesting issue in this country, in the UK, of the COSHH legislation which did mean that we had to have very detailed plans in advance of doing any experimental work and we are leveraging off those plans to provide the descriptions. But that work is ongoing and produces some very useful ideas about how to describe experiments.

But we also came to the conclusion that sometimes that level would be too detailed and of course for early adoption, when you don’t have an agreed vocabulary for your work, there could be an issue there. In many cases it may be, as a way of introducing this to fewer users in newer communities was to adopt a bit more softly softly approach and have what Jim Handler has described as a little semantics goes a long way, just to show how powerful even a little level of extra metadata and extra agreement will do. And we build that into our own blog engine to act as a voluntary blog and that work has now matured into some software called Labtro where the researchers effectively blog their work but the discussion is all around the attached data and we can add a lot of interesting metadata to that to make it much easier to find and reuse the data that’s in there. We’ve applied that to a number of research groups in Southampton and several other universities across the world to gain insight into how this blogging methodology, coupled with some semantics, will improve research. And certainly we’ve observed it changes considerably the nature of the interaction between the research students and the supervisor and the type of ways you can run discussions about the research that’s going on. It’s been a very productive and interesting experience and most people using it have found it quite intuitive and a very exciting way of going forward.