Meso-computing and meso-data: the forgotten middle

Posted on Fri 28 August 2020 by Matt Williams in rse

I've been working as a Research Software Engineer (RSE) at universities in the UK for many years now, across a variety of fields of research (particle physics, synthetic biology, cardiology, etc.) as both a user of computing services and as a provider of them. One thing that I've seen cropping up repeatedly is a tendency to want to chase the latest fad. Now, in academia the latest fad is often up to ten years behind the state-of-the-art so, for example, it's only in the last few years that we've seen "deep learning" cropping up in every grant application. One of the things I do in my job is try to educate researchers about these technologies so that they know what they're actually asking for and to ensure that if researchers are talking about doing "big data" since they have "megabytes of data" they won't be embarased by a grant reviewer who knows their stuff.

The push in this direction is completely understandable. The grant application business is extremely competitive and anything you can put in your proposal to catch the eye of a reviewer is going to count in your favour.

This is all particularly evident when talking about data. Research money goes to the problems which are most worthy and there's an assumption that the harder the problem, the more worthy it is. Furthermore, there's an assumption that the more data one has, the bigger and more difficult the research problem. This naturally pulls together into an idea that if your problem is a "big data" problem then it's more likely to get funded. You therefore see grant applications trying to convince reviewers that they're doing "big data" when really they're dealing with linear or spatial datasets of maybe only gigabytes to a terabyte. That's certainly a lot of data but it often lacks the complexity that requires what one would really consider big data solutions.

Now, I'm not trying to down-play these areas of research. Instead I'm trying to argue that the problems being solved here are just as worthy of investigation, even without couching it in terms of "big data". Most research problems I see discussed at Universites are not big data and they are all interesting problems. I think it would be healthy to reduce the allure of big data and let people know that it's ok to not fall into that category. Part of the problem is that there's no good name for this scale of data problem: that which is a little too big to realistically do on a single laptop or desktop but well below the size or complexity to require a big data machine or a Hadoop cluster.

I've toyed with many names when trying to teach how to tackle problems of this size: Large Data, Biggish Data, Medium Data. Until now, none have ever grabbed me so I've decided to coin a new term, meso-data. Meso here means "middle" or "intermediate" cf. Mesopotamia ("between rivers").

Meso-computing

Along with the problem in the data domain (mostly coming from buzzword-chasing), there's a similar issue in the computing-power domain. Most research follows a common path of starting with a small investigation on a researcher's laptop until they have too many simulations to run or they are taking longer than the working day and so can't be finished in time. At this point most research institutes will encourage the use of whatever central computing resources they have, usually a single large HPC cluster.

Research institutes such as Universities face a pressure of trying to justify their expenditure on computing resources by extolling all the big problems they're solving: how many nano-seconds of molecular dynamics they're pushing thorugh or how fine-grained the meteorological grid they can simulate. This encourages the creation of systems which cater towards those few groups in the university who are able to make really good use of a supercomputer — those who can run large multi-node MPI jobs with optimised code for the specific hardware and who have experts in their team who understand high-performance computing.

The problem with this is that it further increases the divide in power and complexity between running on one's laptop and using a central facility. Similar to with meso-data, there's a large number of researchers — I would argue most researchers — whose needs sit right in the middle. They're not doing supercomputing, they're doing meso-computing.

These researchers are best supported with small domain-specific batch clusters, with cloud computing (perhaps using Cluster in the Cloud), with software-as-a-service or with just some hands-on help from an RSE to get their code running more efficiently on their laptop.

Maybe Pandas is sufficient or perhaps they need to use Dask. Maybe a course on concurrent.futures to magically make their code finish in a quarter of the time is the right solution. Regardless, the solution is probably not to rewrite it in Fortran, using MPI to scale across 64 nodes or to rent a Hadoop cluster.

The middle

There will be those reading this who think I'm stating the obvious, who think "I've been working in this area for years, what's new?". That's kind of the point, a lot of researchers sit here but the fact is that they are under-served. Most aren't computer experts and will use whichever tools are available, advertised to them and are easy to use. This inevitably means that they email to one another Excel spreadsheets with maybe some R or Python scripts with hard-coded paths. These researchers are getting stuck as "Expert Beginners". People who know their entry-level tools so well the short-term barrier to learning how to use them properly or using a better tool for the job is higher than seems worthwhile.

They want to scale their research but when they look around themselves to see what the University can provide, they're told about getting access to the supercomputer or how if they put their data into an Elasticsearch database things would be better. That jump is far too big and we need to solve the social problem of allowing them to take only as many steps across the meso-data divide as they need to. We need well-explained and easy-to-use tools at every stage of the scaling process, not just at the top end.

These terms, meso-computing and meso-data, are deliberately humble. They are explicitly not about trying to be the biggest but rather about thoughtfully considering the problem at hand and choosing the right hammer. Unlike with big data, people shouldn't have to ask the question "is this a meso-data problem" because if they're asking the question, the answer is "yes". I want people to be comfortable saying in grant applications "since this is a meso-data challenge, we request funding for the skills and resources needed to tackle it" and ask for a full-time RSE without having to pretend that they're doing big data or need a dedicated supercomputer. Labels are helpful and I think that these labels apply well to a good chunk of the research community.

Meso-data comes with its own set of tools and solutions which are partially distinct to big data's. I haven't invented a whole new area of endeavour, many people have been working on solutions here for decades, but it's certainly not an area that attracts excitement or research money as it should.

These are still hard problems, and in my experience they are solving real-world challenges or furthering our understanding of the universe. Meso-computing and meso-data projects still need expertise from RSEs or data scientists to make sure that the research is still reliable, reproducible, tested and understandable.