Publications by Type: Journal Article

Goodman AA, Pepe A, Blocker A, Borgman CL, Cranmer K, Crosas M, Stefano RD, Gil Y, Groth P, Hedstrom M, et al.

Ten Simple Rules for the Care and Feeding of Scientific Data

. PLoS Computational Biology [Internet]. 2014;10(4):e1003542. Publisher's VersionAbstract 10simplerules_fromplos_site.pdf
Pepe A, Goodman A, Muench A, Crosas M, Erdmann C.

How Do Astronomers Share Data? Reliability and Persistence of Datasets Linked in AAS Publications and a Qualitative Study of Data Practices among US Astronomers

. PLoS ONE [Internet]. 2014;9(8):e104798. Publisher's VersionAbstract
We analyze data sharing practices of astronomers over the past fifteen years. An analysis of URL links embedded in papers published by the American Astronomical Society reveals that the total number of links included in the literature rose dramatically from 1997 until 2005, when it leveled off at around 1500 per year. The analysis also shows that the availability of linked material decays with time: in 2011, 44% of links published a decade earlier, in 2001, were broken. A rough analysis of link types reveals that links to data hosted on astronomers' personal websites become unreachable much faster than links to datasets on curated institutional sites. To gauge astronomers' current data sharing practices and preferences further, we performed in-depth interviews with 12 scientists and online surveys with 173 scientists, all at a large astrophysical research institute in the United States: the Harvard-Smithsonian Center for Astrophysics, in Cambridge, MA. Both the in-depth interviews and the online survey indicate that, in principle, there is no philosophical objection to data-sharing among astronomers at this institution. Key reasons that more data are not presently shared more efficiently in astronomy include: the difficulty of sharing large data sets; over reliance on non-robust, non-reproducible mechanisms for sharing data (e.g. emailing it); unfamiliarity with options that make data-sharing easier (faster) and/or more robust; and, lastly, a sense that other researchers would not want the data to be shared. We conclude with a short discussion of a new effort to implement an easy-to-use, robust, system for data sharing in astronomy, at, and we analyze the uptake of that system to-date
Beaumont CN, Goodman AA, Kendrew S, Williams JP, Simpson R.

The Milky Way Project: Leveraging Citizen Science and Machine Learning to Detect Interstellar Bubbles

. The Astrophysical Journal Supplement Series [Internet]. 2014;214:3. Publisher's VersionAbstract
We present Brut, an algorithm to identify bubbles in infrared images of the Galactic midplane. Brut is based on the Random Forest algorithm, and uses bubbles identified by >35,000 citizen scientists from the Milky Way Project to discover the identifying characteristics of bubbles in images from the Spitzer Space Telescope . We demonstrate that Brut's ability to identify bubbles is comparable to expert astronomers. We use Brut to re-assess the bubbles in the Milky Way Project catalog, and find that 10%-30% of the objects in this catalog are non-bubble interlopers. Relative to these interlopers, high-reliability bubbles are more confined to the mid-plane, and display a stronger excess of young stellar objects along and within bubble rims. Furthermore, Brut is able to discover bubbles missed by previous searches—particularly bubbles near bright sources which have low contrast relative to their surroundings. Brut demonstrates the synergies that exist between citizen scientists, professional scientists, and machine learning techniques. In cases where "untrained" citizens can identify patterns that machines cannot detect without training, machine learning algorithms like Brut can use the output of citizen science projects as input training sets, offering tremendous opportunities to speed the pace of scientific discovery. A hybrid model of machine learning combined with crowdsourced training data from citizen scientists can not only classify large quantities of data, but also address the weakness of each approach if deployed alone.
Beaumont CN, Offner SSR, Shetty R, Glover SCO, Goodman AA.

Quantifying Observational Projection Effects Using Molecular Cloud Simulations

. The Astrophysical Journal [Internet]. 2013;777:173. Publisher's VersionAbstract
The physical properties of molecular clouds are often measured using spectral-line observations, which provide the only probes of the clouds' velocity structure. It is hard, though, to assess whether and to what extent intensity features in position-position-velocity (PPV) space correspond to "real" density structures in position-position-position (PPP) space. In this paper, we create synthetic molecular cloud spectral-line maps of simulated molecular clouds, and present a new technique for measuring the reality of individual PPV structures. Using a dendrogram algorithm, we identify hierarchical structures in both PPP and PPV space. Our procedure projects density structures identified in PPP space into corresponding intensity structures in PPV space and then measures the geometric overlap of the projected structures with structures identified from the synthetic observation. The fractional overlap between a PPP and PPV structure quantifies how well the synthetic observation recovers information about the three-dimensional structure. Applying this machinery to a set of synthetic observations of CO isotopes, we measure how well spectral-line measurements recover mass, size, velocity dispersion, and virial parameter for a simulated star-forming region. By disabling various steps of our analysis, we investigate how much opacity, chemistry, and gravity affect measurements of physical properties extracted from PPV cubes. For the simulations used here, which offer a decent, but not perfect, match to the properties of a star-forming region like Perseus, our results suggest that superposition induces a  40% uncertainty in masses, sizes, and velocity dispersions derived from 13 CO ( J = 1-0). As would be expected, superposition and confusion is worst in regions where the filling factor of emitting material is large. The virial parameter is most affected by superposition, such that estimates of the virial parameter derived from PPV and PPP information typically disagree by a factor of  2. This uncertainty makes it particularly difficult to judge whether gravitational or kinetic energy dominate a given region, since the majority of virial parameter measurements fall within a factor of two of the equipartition level α   2.
Goodman AA. Principles of High-Dimensional Data Visualization in Astronomy. Astronomische Nachrichten [Internet]. 2012;333(5-6):505-514. Astrobites commentary on this articleAbstract
sets, though, interactive exploratory data visualization can give far more insight than an approach where data processing and statistical analysis are followed, rather than accompanied, by visualization. This paper attempts to charts a course toward “linked view” systems, where multiple views of high-dimensional data sets update live as a researcher selects, highlights, or otherwise manipulates, one of several open views. For example, imagine a researcher looking at a 3D volume visualization of simulated or observed data, and simultaneously viewing statistical displays of the data set’s properties (such as an x-y plot of temperature vs. velocity, or a histogram of vorticities). Then, imagine that when the researcher selects an interesting group of points in any one of these displays, that the same points become a highlighted subset in all other open displays. Selections can be graphical or algorithmic, and they can be combined, and saved. For tabular (ASCII) data, this kind of analysis has long been possible, even though it has been under-used in Astronomy. The bigger issue for Astronomy and several other “high-dimensional” fields is the need systems that allow full integration of images and data cubes within a linked-view environment. The paper concludes its history and analysis of the present situation with suggestions that look toward cooperatively-developed open-source modular software as a way to create an evolving, flexible, high-dimensional, linked-view visualization environment useful in astrophysical research.
Beaumont CN, Williams JP, Goodman AA.

Classifying Structures in the Interstellar Medium with Support Vector Machines: The G16.05-0.57 Supernova Remnant

. The Astrophysical Journal [Internet]. 2011;741:14. Publisher's VersionAbstract
We apply Support Vector Machines (SVMs)—a machine learning algorithm—to the task of classifying structures in the interstellar medium (ISM). As a case study, we present a position-position-velocity (PPV) data cube of 12 CO J = 3-2 emission toward G16.05-0.57, a supernova remnant that lies behind the M17 molecular cloud. Despite the fact that these two objects partially overlap in PPV space, the two structures can easily be distinguished by eye based on their distinct morphologies. The SVM algorithm is able to infer these morphological distinctions, and associate individual pixels with each object at >90% accuracy. This case study suggests that similar techniques may be applicable to classifying other structures in the ISM—a task that has thus far proven difficult to automate.
Pepe A, Mayernik MS, Borgman CL, Sompel HVD. From Artifacts to Aggregations: Modeling Scientific Life Cycles on the Semantic Web. Journal of the American Society for Information Science and Technology [Internet]. 2010;61. WebsiteAbstract
In the process of scientific research, many information objects are generated, all of which may remain valuable indefinitely. However, artifacts such as instrument data and associated calibration information may have little value in isolation; their meaning is derived from their relationships to each other. Individual artifacts are best represented as components of a life cycle that is specific to a scientific research domain or project. Current cataloging practices do not describe objects at a sufficient level of granularity nor do they offer the globally persistent identifiers necessary to discover and manage scholarly products with World Wide Web standards. The Open Archives Initiative's Object Reuse and Exchange data model (OAI-ORE) meets these requirements. We demonstrate a conceptual implementation of OAI-ORE to represent the scientific life cycles of embedded networked sensor applications in seismology and environmental sciences. By establishing relationships between publications, data, and contextual research information, we illustrate how to obtain a richer and more realistic view of scientific practices. That view can facilitate new forms of scientific research and learning. Our analysis is framed by studies of scientific practices in a large, multi-disciplinary, multi-university science and engineering research center, the Center for Embedded Networked Sensing (CENS).