Because of the advancements in machine-learning based image recognition and AI-driven optimization, the influence of big data in the context of computer science is given a lot of attention. But today, we will explore the effect of big data on one of the most exciting fields of science. Astroparticle physics has been the arena for 5 Nobel Prize winners in the last 20 years, due in no small part, according to Dr.Bradley Kavanagh, to developments in big data.
Bradley Kavanagh is an astroparticle physicist working at the University of Amsterdam. He is a theorist who currently specializes in dark matter research. More information about Dr.Kavanagh is available on his website www.bradkav.net and his Github github.com/bradkav.
I interviewed Dr.Kavanagh to get his thoughts on how the move towards larger and larger scales of data collection and processing has affected his field of study. I assumed, due to the tendency in astroparticle physics to go to extreme lengths to get measurements, that big data would have profound effect on the capabilities of of astroparticle physicists. Dr. Kavanagh elaborates below on just how intertwined his field is with computerized data collection.
Astronomy has always been data-driven, but I asked Dr.Kavanagh how he thinks the move towards larger and larger amounts of data processing has affected how many astrophysicists work in theory vs how many work on measurement tool development and data analysis.
[I actually don’t think that’s the way to think about it. From my perspective, what you would have thought of 50 years ago as an observer, looking at images [from telescopes], their role has become technical. As you go to a big, all-sky survey, it becomes that the experimentalist’s job is not just to take images but to write code that deals with all the data from the images and makes it into something useful.]
Right, so you think it has just become more software focused, but isn’t really pulling theorists into experimentally-based stuff?
[Yeah, it’s not necessarily that experimentalists take a bunch of images and you need theorists to deal with it. The job of the experimentalist has changed and transformed to being someone who deals with experiments but also deals with the data that comes out of them.]
Do you think that’s made the job of the experimentalist harder or easier? In terms of how much time they may spend on a particular paper or development?
[That’s harder to say, I would think it has made the job harder but the rewards become greater. There is a lot of detailed data analysis that needs to happen but some of that involves automating tasks that would have taken ages 10 years ago, so now you can achieve a whole lot more.]
Being able to quickly send copious amounts of raw data to other research facilities around the world has obviously increased the extent to which astrophysicists collaborate, but I asked Dr.Kavanagh what data-driven change he thinks is the next best step for improving international collaboration.
[One of the issues is that if you just give someone a whole bunch of data, you would think it would give them the opportunity to use that data, but actually there is a lot of infrastructure that goes with (like how to use the data). For example the Large Hadron Collider has been producing huge, huge amounts of data, and they physically can’t release all of the data that they have, because there is just too much of it. They are the only people who have enough space to store it all, so it just doesn’t make sense to share it. But they have been trying to open with intermediate level data. So what [they] do is, they take all the raw data that they have and they make what they think is the most useful intermediate step [in data processing]. It’s some sort of synthesis, with which they can also provide some sort of tools or explanations [for use].
The Fermi satellite has been doing something similar, slowly releasing data sets and tools so that people can actually use what [the satellite] produces.
The data alone is not enough, it’s putting it into a format that people can use and providing the tools for people to use it. That’s what actually connects people and lets people collaborate on these sorts of things.]
So do you think some sort of data distribution standards between projects would be a good idea?
[It’s hard to make a data standard that works for everybody. Sometimes I try to reproduce the results that [other researchers in my field] have published, and to do that you don’t need all the data that they have, you need some slightly curated set of it; a useful summary. For a while there were a few of us trying to come up with a format for what that summary needs to be. For example, the size of your detector, how long it ran for, what the response was under certain conditions, etc. What we found very quickly, was that making each different experiment fit with that specific data set was just going to be too hard.
Instead, what you’re best off doing is getting everyone to be as open as possible with their data, but also providing as much metadata as possible so that it is reusable.]
Dr.Kavanagh has released quite a bit of code jointly with his academic publications, I asked him if he thinks physicists writing their own software to go along with their research is a consequence of the move towards big data in their field, or if this would have happened anyway just out of sheer interest.
[In my case it started happening because people were trying to reproduce the results of experiments. We came across this problem where I would ask [other researchers at a different university] how they had done it and they would give me a rough idea but it was really impossible for us to compare. I decided that the only way to deal with this is to release my code as a way of establishing a standard.
Since more and more of what we do is data analysis and coding, if you don’t release the pipeline for doing that then it is hard for people to scrutinise because that pipeline has become part of the research that you’ve done, so you have essentially not published your method.
So I think that, probably, in the long run, the more data-science-ey part of science is driving people to write their own code and make it public because not everything can be contained in just the paper anymore.]
We can see from Dr.Kavanagh’s publications that he is focused on the theoretical side of astroparticle physics, but I asked him how essential he thinks collecting large amounts of data is to the progression of the field. In other words, as of right now, is the field more measurement driven, or more theory driven?
[At the minute it’s definitely data driven. In the past, particle and astroparticle physics have gone through phases of having a bunch of data which appears to be connected together in strange ways and no one understands how or why it is connected. And that’s paradise if you’re on the very theoretical end of the spectrum because you can take that [data] and look at it and try and make sense of it.
Right now, it is not that we have no data, but all the data we have agrees with our baseline understanding of things. So people are looking for exciting things, the edge cases, the extremes and the anomalies to try and explain things. So I definitely think we’re in kind of a data-starved situation, where a lot of what [researchers] do involves theorising what you could do with the next, up-and-coming experiment.
For example, in my field of dark matter research, we don’t have any new and interesting data, so people are starting to shift to other fields where more data is available.]
The theory of dark matter is attractive because it provides very good explanations for phenomena such as the way in which universal inflation occurs, or the way mass orbits its galaxy. In fact this theory fits the data so well, physicists take its existence for granted. But there has yet to have been a direct detection of a dark matter particle, so there was definitely a bang, but we haven’t found the smoking gun yet.
As a dark matter specialist, Dr.Kavanagh must be very used to this rather distinct absence of data. I asked him what one dark matter detection method is, that he thinks could prove successful if the scale of data collection was higher.
[There is one very cool dark matter detection method, called directional dark matter detection, where, after the detector is triggered, you try to track the path of the particle that hit it so that you can trace back where the dark matter particle came from. If you could do this for enough particles, it would be an incredible smoking gun, because you would see that all the particles are moving along the direction of the earth through the galaxy. The problem is, to do this you need a detector that is not very dense because you need to be able to track the particles. But if your detector is not very dense then you expect the number of detections to be small because there is less for the particles to collide with. In that case you need the detector to be huge in order to get enough data. But a detector that big presents you with a lot of technological challenges for data collection, which require a lot of funding to overcome. ]
Theory often reveals entirely new ways of interpreting established experimental conclusions, but I asked Dr.Kavanagh if he can think of an example where the scale of the astrophysical data has revealed something new about the theory
[When you observe galaxy rotation curves, back in the 80s it would take a person a year or so to go to a telescope, measure the rotation of stars and gas in galaxies and trace out the rotation curve. But we now live in an age when a space telescope does this automatically, and sends the results to physicists who clean and make sure it makes sense and process it into rotation curves. But you could not have told from the old data, simply because there couldn’t have been enough of it, is that if you line up [all the curves from each galaxy], they match in a very strange way, with very little scatter between them. It seems that this pattern between galaxy rotation curves is some sort of universal law, and is taken to be evidence for dark matter. There are a number of other explanations for it, but the point is that if you took just one galaxy and drew it’s rotation curve, you would see nothing. It’s only if you do it for 200 galaxies with over 3000 data points, each with its own data processing pipeline, that you see this correlation.]
Big data’s true meaning or usefulness is often only revealed when it is presented or stored in the correct manner. I asked Dr.Kavanagh what the standards are for presenting large amounts of data in astroparticle physics (graphically, data storage formats, pattern analysis). And does he think they could be improved by greater standardization, ie less stylistic choice by the paper authors?
[I would say it’s starting to become standardized. People have started to use specific high-density file formats, for example the FITS file format in astronomy. It is specifically designed to contain all the metadata you would need for astronomical measurements. This is necessary because you’re going to be imaging tons and tons of data from thousands of stars, so you need to have a way to optimise the process.
In astroparticle physics I think we haven’t really gotten that far yet, I think it’s because each experiment is so different. The FITS file came out of a bunch of pixels being your data, and so there is a sensible way for [researchers] to agree what their data is going to look like. In astroparticle physics sometimes your data for example is voltages, sometimes it’s a list of events, so it’s very hard to standardise.]
Do you think this is because astroparticle physics is still quite a young field? Or do you think this is just intrinsic?
[I think a young field would come up with a standard if there were an obvious one. Astroparticle physics being a young field with lots of different players doing different things makes it tricky.]
Finally, to end on a light note, I asked Dr.Kavanagh if he could get his hands on any piece of data he wanted by snapping his fingers, which would it be?
[My head is telling me that what I want is either stars in the Milkyway or galaxies everywhere in the universe. So the positions and velocities of every star in the Milkyway, and this is funny because the Gaia satellite, which I haven’t even talked about, gives you almost that. From that you could reconstruct the mass distribution in the Milkyway, which would be an incredibly cool thing to do.
The other is the correlations of galaxies over the whole universe, I don’t even know what I would do with that, but it was the first thing my brain went to! Someone would want to use that.
I think this is what astroparticle physics is sometimes; taking data that’s already there and thinking of strange things that you could do with it. So if I had a hard drive with the distribution of all the galaxies in the universe, I think I could think of something interesting to do with it.]
I mean I can give you that, I just have to put quite large errors bars on it!