Visualizing the legacies of Einstein's general theory of relativity through scientific research one hundred years after its discovery

Creative Direction
Jer Thorp

Data Research + Graphic Design
Genevieve Hoffman

Data Scraping + Concept Modeling
Ellery Royston

Software Prototyping for print graphic
Noa Younse

Front-End Web Development
Genevieve Hoffman

data sourcing and research, visual design, front-end development

2015 marked the 100-year anniversary of the General Theory of Relativity, which Albert Einstein proposed after publishing the Field Equations of Gravitation paper in November 1915. Scientific American approached The Office for Creative Research (where I worked from January 2015 until June 2017 when it ended operations) to create a visualization that explores the impact of the general theory. After researching available data, we decided to create a visualization that takes a snapshot of scientific research a century after the general theory was proposed, to show which areas of physics research derived from the general theory are most active a hundred years later.

We created a three-dimensional landscape of scientific papers tagged with the General Relativity - Quantum Cosmology category on, a database which contains scientific papers that have been published or are under peer review. Using a list of about sixty keywords of subjects that relate to general relativity, we analyzed papers and grouped them according to which keywords they contained. More popular keywords have more papers near them, which push up to form “peaks” of the most popular keywords, like black holes and spacetime. The print visualization, as well as an interactive version, are on Scientific American’s website.

Much of the following is also documented as a process article I wrote for Medium while at OCR

First attempts and dead-ends

Princeton University had recently digitized all the papers and correspondence that Albert Einstein wrote from his youth until 1923, which can be browsed on their digital collection. While Einstein was developing the general theory of relativity for about a decade, it wasn’t until the Field Equations of Gravitation paper, published in November 1915, that he finalized the mathematical models necessary to support his theory. Our initial concept was to use this paper as a starting point in a citations trail to map the reaches of the General Relativity paper over the past 100 years.

We looked at a number of open and closed scientific databases, such as Scopus, Google Scholar and Web of Science, to see how we could go about creating a citations trail. However, while some databases were more complete than others, very few contained citations data that predated 1970. This has to do with the fact that citations in scientific papers were not as standardized in the early part of the twentieth century as they are now, and that scientific databases are more likely to upload newer papers and citations data than historical data.

An early concept presented to Scientific American for the visualization. Graphic by Jer Thorp

There has to be a better way

Since there were too many gaps charting the impact of general relativity over time through citations, we decided to use the plethora of current scientific papers to take a snapshot of research actively engaging with principles derived from the theory of general relativity.

We decided to use Cornell University Library’s open database, which some consider to be the most current repository of research since scientists can upload papers for peer review before they’ve been published. also categorizes papers that pertain to different areas of science, and within that, physics. One of’s subcategories of Physics is General Relativity — Quantum Cosmology (gr-qc). For our dataset, we selected papers with gr-qc as their primary category, since we could be sure that they related to general relativity. We looked at all papers in the gr-qc category that were uploaded to in 2014, to give us the most recent complete year of papers.
An alternative approach looking for physic disciplines as linked to one another on Wikipedia

The ins and outs of an API

While is great in that it’s an open database (not behind a paywall), and has an API, the API is a bit confusing to use and not fully documented. For instance, we were interested in accessing references and citations for papers, which can be done on the website, but there was no call in the API outlining how to do so (we were left to our own devices to get this data…). In addition, we couldn’t query by year, but had to use the somewhat clunky “start” and “max_results” filters with the sortBy filter set to “submittedDate” to get the year we were interested in.

While we did collect citations data from, there weren’t that many citations linkages since all the papers we examined are from 2014, and don’t often cite one another since they’re all very recent. We needed to generate our own metrics to determine the most popular research topics related to general relativity. For each paper in the the General Relativity — Quantum Cosmology category we collected the following information:

        List of Authors
        Primary Category
        Subcategories (if any)
        References (if any)
        Citations (if any)
        Published / Not yet published

Example of a record returned by the API

If at first we don’t succeed, let artificial intelligence take the lead

Since we couldn’t rely on citations data to create links between papers, we decided to look to the text of the abstracts to see if we could group papers based on common research areas. We turned to the Alchemy API to process the 2,435 paper abstracts of all the papers added to in 2014 and tagged with General Relativity — Quantum Cosmology (gr-qc) as their primary category. The Alchemy API, which is now part of IBM’s Watson platform, lets users leverage machine learning capabilities for image and text processing. We were specifically interested in the AlchemyLanguage Concept Tagging API, which we used to analyze the paper abstracts.

When we ran the corpus of abstracts through the Alchemy API it returned a list of concepts, as well as a score of relevance for each detected concept, with a range of 0.0–1.0, 1.0 being the most relevant. Alchemy returned over 1500 concepts after analyzing the 2,435 abstracts. We totaled the scores for each concept to determine the “most popular concepts” among all the abstracts. This list was heavily edited to cull redundant words, topics, and finally reduced to 61 concept terms that the editors at SciAm deemed relevant to physics and general relativity.

Laying out the network — making connections and forming relationships

First, we wanted to see how the concepts themselves related to one another. Using a network graph layout, we created links between concept terms that were found in the same papers. When we evaluated a pair of concept terms, we counted what percentage of the total papers they share in common, as well as their combined Alchemy relevance score. Using the toxiclibs Processing library, we generated a basic layout model for the concepts. If two terms have a higher percentage of papers that share those terms, they have a stronger network graph link. The terms with more connections were more fixed (and more popular), and provided the central nodes that the other terms organized around.

Early prototype of the base network of 61 concepts, brighter nodes are concepts that appear in more papers. Made by Noa Younse
After laying out the 61 concept terms in the network diagram, we needed to arrange the papers around the terms they referenced. Using a particle physics simulation, terms acted as gravitational attractors to the papers. Daniel Shiffman’s Box2D processing library was used to prevent articles from overlapping while they were being pulled towards their preferred location. This pushed articles upward that grouped around more popular terms, creating peaks around terms like ‘Black holes’ and ‘Quantum gravity.’ Although there were over 1500 concepts that Alchemy had returned from all the papers, a few of the papers did not contain any chosen concept terms since we’d narrowed it down to 61. If papers did not contain one of the 61 chosen concepts, we used a combination of shared references, citations, and keyword matching with the abstract text to locate the papers nearest to the papers most relevant to them.

Using a particle simulation with forces like gravity at the various term nodes to lay out the articles. Noa Younse created a custom tool in Processing which I used to lay out the articles

While the visualizations for the print version of Scientific American are more legible as top and side views, we created them with a 3D layout built with Processing.

Once we had a layout we found pleasing, we saved those positions and output the articles as vector graphics from a top and side view. Ultimately, we felt these offered a clearer understanding of the relationships across articles in a print medium.
I spent a lot of time finding the right view of the articles in their final positions
An early draft of the visualization layout. We ended up taking out the bar charts visualizing the overall breakdown of papers across categories on and replacing it with a key to read the visualizations
The final visualization in Scientific American
An early export for the article intro image 
The final article intro page

Moving from a 2D to a 3D world

For the interactive component to the graphic, we decided to let users explore the 3D environment, allowing them to zoom in and out on areas that interested them. In order to draw in 3D on the web we turned to three.js, the wonderful WebGL library created by mrdoob. We saved out the positions of the terms and papers from the print version generated in Processing, and used these values to draw the shapes in three.js.

A screen capture of the interactive version of the graphic that I developed for Scientific American’s website

Some notes on our approach

Since the graphic was published, we’ve seen questions pop up on Twitter about our process for laying out the papers. Someone suggested we might have used the t-SNE method to layout the papers, which visualizes multi-dimensional aspects of data in 2D space. We didn’t use a multi-dimensional approach, but relied on the Alchemy API and network layout models to group the papers after analyzing their abstracts.

Our visualization approach was not as straightforward as you would get in a layout function from Gephi, but more an interpretive analysis that allowed us to spatialize the dataset around concepts that we curated for relevance. The Alchemy API allowed us to generate numerical relationships between the papers themselves when a citations link did not exist. As we laid out the articles in a sort of generative terrain, we realized that we needed to give more “gravity” to unpopular terms, so that the large groupings forming around popular terms would disperse into more discrete clusters.

And some lessons learned

One aspect we found interesting are the articles with more than 850 authors, which are highlighted in red in the visualization. Articles with this many authors are all related to the LIGO Detector gravitational wave detection, and authored by the LIGO Scientific Collaboration. It was also surprising to realize that one paper, submitted in March of 2014, had already been cited 85 times a year later. This paper focused on advancements of tests of the general theory, which are still being done today, which made us realize how much of science is a constant process of experimentation and re-evaluation, even with ideas that have been given the status of “theory.”

This project involved a fair amount of trial and error. Sometimes there is a promising germ of an idea but it’s hard to predict what the data will actually yield. The kind of visualization that is possible to create depends on the data that can be found and understood. As is often the case, datasets are not always complete, or readily accessible. Working with a historian of science to map citations in early physics papers would have been great, but we needed to change directions once we realized that wasn’t feasible in our scope of time.

As designers, there are a many ways we can tell a story, and determining the best way forward is an iterative process of research, sketching and prototyping. Sometimes a visualization utilizes “neutral” scientific data, while other times, like this, it depicts a collaboration between natural language processing algorithms and scientific expertise and curation. In any case, it’s not quite a surprise that Einstein’s ideas are still relevant today, but it is amazing to realize how many aspects of research in the quest to better understand our universe they’ve made possible.
Copyright © 2024, Genevieve Hoffman. All Rights Reserved.