Starting at the end
This post looks at how I produced a spatial mapping, based on publication subject areas, of Warwick University Mathematics Institute (Coventry, UK). At the time of writing, I have been a Visiting Researcher at the Warwick maths department for a few years.
I used nonlinear dimensionality machine learning techniques to convert published paper topic codes (MSC codes, explained later) into spatial relationships.
The mapping is based on publicly available (well, paywalled in any case) data from the American Mathematics Association.
Eagle-eyed observers will note that I am not present. Since I am not faculty, I didn't put myself on the map.
Read the article to find out how this works.
No time for side projects
Way before this post was put together, I was in at The Alan Turing Institute. A normal day, I had my typical research meetings. My work at the time (and for some time before that) centred around machine learning for engineering applications. Less social and more mechanical. At some point in the meeting, my colleague (a Warwick maths faculty member) had an idea.
What if we took the MSC classification codes and built a map of who publishes what in the maths department?
The Mathematics Subject Classification (MSC) codes are available on MathSciNet, managed by the American Mathematical Society (AMS). When publishing a math paper, researchers can assign these MSC codes to their work. This helps categorise papers by topic. The list of codes as of 2020 is here.
A cool idea!
I had plenty of experience with machine learning techniques that would be needed to produce a good map. Nonlinear dimensionality reduction methods would be required.
We discussed it for a little while, but it wasn't really critical. It wouldn't have a big impact on me and would be at best a side-project. Plus, what if we upset people? Who would end up where on this map? Who who be in the 'center' and who would be on the 'periphery' (even if these concepts didn't mean anything to the algorithms)? Academia can be political and I didn't want any noses to be put out of joint over a distraction.
Well, then came the kicker. The MathSciNet terms of service. You are not allowed to scrape the database. Getting the MSC code data would be too annoying to make this project worthwhile.
Access to the database is paid, like journal articles. If you have an academic affiliation, the university (usually via the library) provides access to the paywalled academic literature. So, you aren't really paying personally for journals (or are you, publishing is big, profitable business for everyone but academics).
I do not want to lump MathSciNet in with the big journal publishers, they have operating costs, so fair enough that they need some income.
In any case, since an academic login is required to access the MathSciNet database to get the publication codes, it's not like you can just unethically violate the terms of service and anonymously get away with it, even if you wanted to.
No data unless it is manually collected. That means taking down the publication records for around 90 people. Some of these people have triple digit numbers of papers. I didn't have the time or the inclination to put that together for a side project.
This is a little annoying, the data is sort of open source, but sort of not. You can kind of use it, but kind of not. All that data sitting there, all dressed up but nowhere to go. Maybe it was for the best?
Dead before we began.
Back from the beyond
I had an online meeting (since corona had now kicked in) with the colleague who had proposed the original idea. He was excited. He had the data. Perhaps just the list of names? No. All the MSC code data, obtained the hard way and all above board.
We could proceed.
I still had my misgivings, some sense of unease. People are becoming increasingly worried about the effect of the 'algorithmisation' of our lives. Do categories and taxonomies just drive wedges between us? I have always resisted being classified and labelled. Should I now do that to others?
Well, he had already started. A valiant first effort, but much more needed to be done. I decided I had better get with the program since this was going ahead with or without me.
He is a mathematician, not a data scientist. Super experienced in theory, but less experienced in the use of these algorithms. I can only imagine, but I assume that he found out quickly that there is a huge degree of subjectivity in data science. There is art mixed in with the science. It is the art that really takes time to learn.
Like any sort of data modelling, from engineering to social networks, this subjectivity is like a live wire or a tank of fuel. It is a source of power and danger. I managed to set aside whatever was gnawing at me. If I didn't use my skills in this area, maybe whatever problems I could sense would be worse than if I did nothing?
Sign me up.
How to produce a pretty map?
I took over the project. Data in hand (or on hard drive at least), I thought about how this could work. With the data, I started the map again from the beginning. A fresh start.
I would need to collate the MSC code data. This data doesn't just come as one number. Each individual on my list has one or more papers. Each paper has a primary MSC code, and (possibly) several MSC codes. I could write a Python script to put together all this data.
The Python script would then take the data and run some statistical algorithms. No big deal, this looked like a job for scikit-learn, a Python machine learning library.
I would run the data through
scikit-learn and go from there. I knew from experience that getting usable plots always takes some finesse, but I was ready. I didn't know what algorithm would produce the best output (for example t-SNE or PCA) so I would have to experiment.
The dimensionality reduction algorithm could produce a series of points over a two dimensional plane. One point per researcher. The hope was that the proximity of the points should reflect who has, historically, published work in similar areas. If everything went well, nearby points would be researchers with similar interests. Points far apart would be researchers with less related publications. Fingers crossed.
Note that the algorithm would have no understanding of the codes themselves. To the machine, these codes are just numbers. I had to hope that the map would come out ok.
I decided to add two extra parts to the plan.
Extension 1: Colouring the map regions
The first extra part of the plan would be somewhat more subjective. Dimensionality reduction algorithms on point data (like I had) produces a set of points in space. Maps like this (composed only of points) are a little hard to read.
I would use the known data points on the map (one for each researcher), as well as the MSC codes at each point to generate an 'influence map'.
Sort of like in the example image of a Gaussian Mixture model shown above, I could generate a map showing the influence of different researchers on different topics spatially.
What if each researcher generated a colour region around their assigned point, with a colour dependent on their MSC codes? Then, if the similar researchers were placed close together by the first step of the algorithm, the colouring algorithm should produce 'regions', like little nations on the map.
In a sense, the researchers could act like little kernel basis centres on the map. The colours, basis functions and influence strength would have to be tuned by hand to produce something visually appealing.
Extension 2: Correlation cluster analysis
The second extra part of the plan would be easier. Dimensionality reduction, necessarily, obscures certain aspects of the data. I am a big fan of cluster analysis and correlation plots, so I would use seaborn to do some clustermap analysis.
I prefer this sort of analysis compared to dimensionality reduction for data like this. The map wasn't my idea, but I thought I would extend the project to include some of my preferences.
Correlation matrix visualisation can be harder to read (in some ways) but is a more accurate representation of the data. Since producing this analysis should be easy, I would put together a correlation map.
Actually producing the map
Counting up the MSC codes and building a distribution
The first step was to collate the data. Each researcher on MathSciNet has a reference number, the MR number. I used this as the unique reference for each of my data points.
What data to collect? The codes for each researcher would need to be compressed into some simple vector representation for further analysis.
The way to really think about data like this is in terms of conditional probabilities. I wanted to know That is, the probability for a given collection of codes given a researcher.
From there, I could work out correlations between researchers and so on. Remember that each paper has a primary MSC code and possibly several secondary codes. Unfortunately, there is no consistency about how many secondary codes a paper should have or how important a secondary code is compared to a primary code.
I decided to go with the simplest approach I could think of. If it wasn't good enough I could come back and try something more statistically elaborate. For each researcher, I simply built a count of how many times they used each code. Primary and secondary codes were considered independent and stored independently.
The original data could be thought of something like 'codes per paper per researcher'. The effect of finding a distribution over codes was to compress this information into 'codes per researcher'.
Already we are compressing some information away, hopefully the map would not be compromised.
I also saved the parsed data to disk. If I had to rerun everything, I didn't want to have to do all this processing again.
Generating a correlation matrix for the clustermap
To make sure that the data would be suitable for mapping, I needed to run the cluster analysis. To build the clustermap, I needed to overcome the next challenge. I needed a correlation matrix. That means I needed to generate some sort of feature vector describing .
Again, I picked something simple. The primary and secondary codes were assumed to be independent. The feature vector for each researcher was simply the concatention of the primary and secondary vectors.
This assumed that the primary and secondary codes are independent. This is almost certainly not accurate, but given only ~90 researchers I would not be able to infer the true joint distribution accurately anyway.
Further, I was working with around 100 MSC codes. Some papers have upwards of five secondary codes. A full joint distribution over the primary and secondary codes would mean several tensor products, resulting in some totally enormous feature vector. Hopefully, my independence assumption would be enough.
Was it still going to be ok to have a map which was making claims about who was 'near' who in the department based on this potentially compromised distribution?
I set about arranging the data from each researcher into a set of feature vectors. Each feature vector was of the form where refers to the frequency of the primary MSC code with number . Similarly refers to the frequency of the secondary MSC code with number .
The vector was effectively a histogram of 'code-counts-per-researcher', normalised by the total number of codes.
For each researcher, the vectors were arranged into a matrix, . This meant that the vectors became the columns of the matrix (so had one column per researcher).
The correlation matrix was computed by where is a normalisation factor.
I didn't really have the proper joint distribution over the MSC codes, but I had a correlation matrix computed using a simple method that might just work.
Seaborn makes cluster analysis very easy. The basic command to run the analysis is
cg = sns.clustermap(correlationMatrix)
Note that the code I really used looks like
cg = sns.clustermap(correlationMatrix, cmap ="YlGnBu", xticklabels=displayNames, yticklabels=displayNames)
where the display names were computed based on each researcher's surname and their most frequently used primary MSC code.
I ran the code and hoped that my simplified, compressed feature vectors would be ok.
It was good! There were clearly defined regions. The people who should be near each other were near each other. By this time, working on the problem, I had perhaps forgotten my misgivings about who 'should' be near who, but at least numerically things were sound.
Now I was confident. It could be done. The map existed, somewhere in the data. I would just have to find it.
The map - dimensionality reduction
I ran t-SNE over the matrix of features (discussed previously) using
tSNEMethod = manifold.TSNE(n_components=2, init='pca',random_state=seed) X = mcsFullFeatureVector #the F matrix Y = tSNEMethod.fit_transform(X)
init='pca'argument tells t-SNE to use a PCA initialisation phase.
- Since t-SNE is a local optimiser, it uses randomisation to achieve outputs. The
seedargument controls the random number generator.
n_components=2argument means I want two output features. I took these two components to be the x and y positions of each feature on my 2D map of the department.
The results looked good. Everything was going great, so I decided to move on to the influence mapping. It was the only way to tell if the t-SNE map really was ok. If it wasn't, I would have to return to this step. I knew that the random seed would have a big influence on the output, but I needed more information.
This was easy, so I was confident.
The influence map
Was confident. Until I got to this part. That shifted to frustrated. Suddenly I was back in the wilderness. No more nice Python packages to download and run. I would have to do some real work.
To output the influence map, I generated a lattice over the region around the data points on the 2D map plane. I figured that the primary codes were the main cause of the clusters, so the influence map was based on the primary MSC codes.
To generate a lattice data structure with a resolution of 80 by 80 I ran
MAXIMUM_MSC_CODES = 100 [ ... ] numCellsX = 80 numCellsY = 80 primaryCodeInfluenceMap = np.zeros((numCellsX,numCellsY,MAXIMUM_MSC_CODES))
At each location in the lattice, I could compute the total influence for each MSC code. This influence could be computed for each researcher based on the distance of their mapped point to the lattice centre.
So, mathematically, the primary code influence map is of the form where is the fractional influence of primary code at location (where is a lattice cell location). Each vector is normalised and represents the influence of each primary code at each location .
Well, that is the idea. I had to include an additional code, 101, meaning 'no influence here' so that no vectors would be the zero vector after normalisation.
This was pretty data inefficient, but good enough for a quick project!
For each person, I computed an exponentially decaying influence, , over the map based on the Euclidean distance between the researcher's map location, and the influence map lattice cell, . That is where:
- is the influence weight. I had to tune this depending on the t-SNE seed.
- is the Euclidean distance between the cell location and the fitted researcher location .
gives the influence of the primary code of person at location . To compute the actual influence of the primary code at , I used linear interpolation: where is a one-hot vector, with a value equal to one at the researcher 's primary code, and 0 everywhere else.
Note that the usual issues relating to the numerical stability of normalisation needed to be applied.
At this point I was still frustrated, but at least I was starting to get somewhere.
Colouring the influence map
Now, it was time to actually colour the influence map. I could assign a colour to each cell. I would just assign a colour to each MSC code, then alpha blend the colours at each lattice cell by setting the alpha of each colour to the value of for that MSC code.
The simplest sort of alpha blending method is linear. If you place a filter of one colour () over another (), you can linearly interpolate them as follows where is the alpha colour channel of .
In graphics and shader programming, things get a bit more complicated (you usually use premultipled alphas) but whatever, linear interpolation would be fine here.
Cool, sounds good. This shouldn't be a problem.
Wait. Oh no. What colour? Every t-SNE seed would generate a different map. That means that different clusters of researchers could end up next to each other in any order. Since each cluster could move around (being placed in essentially a random location on the final map by t-SNE), pre-colouring the clusters wouldn't work.
If you read up on map colouring, like in the four colour theorem, it is not a pleasant problem. Well, I suppose that depends on how you feel about it, but in my case I knew that I was in for some trouble.
Moreover, sure you can theoretically generate a zillion colours, but how do you ensure that you will produce a pretty output with nice colours next to each other? After blending them? More 'oh nos'.
I fell back on what is basically my favourite algorithm ever. Monte Carlo methods (although really just random sampling here). Throw enough mud and hope some sticks. I could randomly generate colour assignments until I got something that looked pretty close to what I wanted.
After that, a few manual tweaks could give me what I needed for each map.
I hoped that this would work. I couldn't feasibly come up with a colour map that would always look ok. The amount of work was starting to pile up. I saw some major problems coming my way if I couldn't get the colouring to work easily. Colouring is always subjective, so maybe this would never look good.
Worried, I rolled the dice.
The computer ticked over; the influence map and colouring takes some time to compute. But there it was.
That's it. A few colour tweaks here and there and it's over.
One map of many. An infinite variety could be generated. I liked this one. If you compare the map and the correlation matrix clusters, you will see that the clusters on the colourful map match up well.
At first I didn't think this was such a good idea. As I mentioned, I don't like to be categorised, it lets someone put you in a little box and leave you there. People are more complicated than that.
I pressed through this project without thinking too much about what it means to run analytics on another person. You are just too detached. I think some of my concerns at the start of this project were well founded, but that shouldn't stop us from understanding what data means and how we can use it.
What this project did highlight is that analysis is subjective. Compare the first map, at the start of this article, and the last. Who is placed where, what is coloured in what way. They aren't exactly the same, but they do rhyme. The results are interpretable, not absolute. Like everything.
Looking at the maps, I had an insight. This isn't just a taxonomy, categorisation or separation of people. Even without data analytics, people can act as if we are all divided. What these maps show to me is that despite living in our little clusters, we are part of a greater whole.
If you liked this article, please get in contact! You can send an email or chat in real time on discord.