Introduction to SMART Genomics

The SMART Genomics API is a means to classify and package genomic information for the use in the clinical realm. With the influx of data supporting personalized genetic medicine, a need arises to accommodate the use of this information in the electronic medical record, by point-of-care providers. Since a clinical API has been defined and supported by the SMART clinical initiative, it is natural to model the use of genomic data in a way similar to this established method.

The aim of this project is to introduce the standard for access to genomic data called the SMART (Substitutable Medical Applications, Reusable Technologies) Genomics API being developed at the Harvard Medical School's Center for Biomedical Informatics as part of the United States Office of the National Coordinator's Strategic Health IT Advanced Research Projects program. This standard defines the means to retrieve data from various sources and makes it easy for a developer to integrate multiple data into their applications. By adopting this standard, a data source enables the developer to gain access to their data and facilitates the end-user to use all available information.

The advent of genomics has lead to huge amounts of biological data. Each type of data produced has its own set of properties describing an experiment's conditions and results. Biological data varies in its characteristics. Experiments are diverse having different types of data depending on the technology from which they derived. The data size, file type, discreteness, and interrelationships all play a role in making up the information produced by a specific type of experiment.

Data deriving from biological laboratories differs greatly from that seen in medical records. Medical records maintain relatively small data in comparison to that of biology. Medical information most often is discrete making it possible to split it into smaller parts for both transfer and storage. The data in medical records is simple often taking the form of low-tech text and images. Some types of biological data differ from these aspects by its large size, having disctinct, varied formats along with being complex and unsplitable. Biological data proves hard to manage.

In trying to manage this unwieldy data, several methods of storage, retrieval, and processing have been proposed and employed by biological data sources. In the scramble to accomodate dissemination and usage of their information, each source of data often produced their own methods for managing the information along with their own set of standards. Advances in technology necessitate different requirements for biological data. As a result, they are difficult to combine and compare. Electronic medical records do not have these problems as the rate at which the data is evolving and how it differs is small. This makes defining standards for medical information easier to define and maintain than that of biology. Standards involving medical records are highly structured and rigid. The standards used in clinical settings, like that of HL7 or loinc, define guidelines for its packaging and dissemination. These standards go further by characterizing how the data is to look. HL7 provides a definition for the types of data found in electronic medical records. Format and structure of data data are a foundation of it. Similarly, loinc provides a method of coding lab measurements and results and defines specific rules to employ in formatting the data and what type of data is allowed.

Biological data dissemination and usage must be elastic. Not only is the information constantly a work-in-progress and ever-changing, it doesn't lend itself to fitting rigid standards. Genomic data must be packaged and distributed in a way unseen before to the field of information management. Standards that define these tasks need to accomodate the inherent differences in biological data from that of its medical counterpart. In addition, the ways in which we handle data requires a less constraining set of rules to follow. This way standards can be agile and allow for new data of types unforeseeable upon their definition. A new technology invented will inevitably have different data with different formats and requirements and the existing standards need to accommodate for this.

Each source of data maintains different systems for how the data is formatted, transmitted and, ultimately, the way it is utilized. Using the standards defined by SMART, packaging and retrieval of biological data is homogenized across all data sources regardless of the information they maintain and from what technology in which they derive. The set of standards provided by SMART define an application programming interface (API) which lays out the means to access each data source's data.

There are three entities involved in the SMART model: data sources, developers, and end-users. A data source refers to the provider of the data. Any organization producing biological data, whether a company, public database portal, or academic lab, can be a data source. The role of the data source is to create a "binding" that associates their data to the components found in the API. A developer produces the applications that facilitate the use of the data. The developer is responsible for creating the workflow for how the data will be used and what the end-user can do with it. An end-user is the one to use this application to get meaning from the data.

In retrieving data, developers may utilize a single method to pull it from the data source. Instead of having to learn how to connect and negotiate with each source's server, the developer would use the data source's binding. A binding is a layer of abstraction lending itself to exposing only the means to get the data and nothing more. It masks the features specific to each data source's organization along with the nuances of the different types of data. The only thing exposed for the developer is a means to fetch the data. Regardless of the type of data or where it is coming from, the developer only needs to know this single function. The binding is produced by the data source and typically is a piece of software programmed to access their specific data regardless of the infrastructure or where it resides. The SMART standard would incorporate each binding into the API.

As more data sources adopt the SMART standard, a rich array of data becomes available. This allows the developer to assimilate data from multiple data sources in a single application with ease. An application could be as simple as performing a single task like converting from one file format to another. An application may also be complex, integrating multiple data from multiple data sources. The goal of complex applications is to provide many sources of data for the end-user to base their analysis. A combination of different data types integrated together provides multiple lines of evidence in which to base their conclusions.

There are many benefits to adopting the SMART API. For the data source, the benefit is exposure. Users will be better enabled to access the information and, therefore, compelled to use that source's data. For the developer, the advantage is efficiency by having the ease of data integration in their application. The end-user gains better access to the information making for better analysis and, ultimately, better science.

The three steps to participating in the SMART standard starts by defining what information should be accessible to the end-user. Next, determine how the data will be packaged and disseminated. Then produce the binding by programming the method to fetch the data. The bioinformatics group at the data source's organization could assist with the process.