If you have recently received an email from human resources announcing that you are expected to publish three papers over the next year in journals with an impact factor of at least 20, there is one crumb of comfort. You will at least be able to enter the misguided missive for a new, annual âbad metricsâ prize, modelled on the Literary Reviewâs Bad Sex in Fiction Award for cringeworthy descriptions of hanky-panky.
The award, which will go to âthe most egregious example of an inappropriate use of quantitative indicators in research managementâ, is the last of a series of recommendations on the use of metrics in research assessment arising from a major new review, whose report is published today. James Wilsdon, chair of the steering group for the review, admits that the idea is a âbit sillyâ, but stresses that it illustrates the serious point at the heart of the reviewâs conclusions: the need for more âresponsible metricsâ.
The report notes that âthe metric tideâ, after which it is named, is being whipped up by âpowerful currentsâ arising from, inter alia, âgrowing pressures for audit and evaluation of public spending on higher education and research; demands by policymakers for more strategic intelligence on research quality and impact; [and] competition within and between institutions for prestige, students, staff and resourcesâ.
Metrics â numbers â give at least the impression of objectivity, and they have become increasingly important in the management and assessment of research ever since citation databases such as the Science Citation Index, Scopus and Google Scholar became available online in the early 2000s. Metrics are particularly popular in political circles. The government commissions a report on UK research strength every couple of years from Elsevier, owner of Scopus, and its â that with just 3.2 per cent of global research spending and 4.1 per cent of the worldâs researchers, the UK receives 11.6 per cent of all citations worldwide and produces 15.9 per cent of the most highly cited articles â is frequently trotted out as proof that the country punches above its weight.
ÌÇĐÄVlog
Within universities, too, metrics have been widely adopted, not merely for institutional benchmarking but also, increasingly, for managing the performance of academics. A recent study by ÌÇĐÄVlog suggested that individual metrics-based targets of one form or another have been implemented at about one in six UK universities.
attributes this state of affairs to the increasing pressure on universities to be âmore accountable to government and public funders of researchâ, and also to the financial pressures imposed on institutions by constrained funding and globalisation.
ÌÇĐÄVlog
âWithin this culture shift, metrics are positioned as tools that can drive organisational financial performance as a key part of an institutionâs competitiveness,â the report notes.
It adds that metrics have âhelped to make decision-making fairer and more transparent, and allowed institutions to tackle genuine cases of underperformanceâ. However, âany moves towards greater quantification of performance management are a cause for concern among some academics, who fear it will erode the traditional values of universitiesâ.
A legitimate, recurrent concern, it says, is that research managers can become âover-reliant on indicators that are widely felt to be problematic or not properly understoodâŠor on indicators that may be used insensitively or inappropriatelyâ, and do not âfully recognise the diverse contributions of individual researchers to the overall institutional mission or the wider public goodâ.
Another concern is that metrics distort scientific priorities, especially among early career researchers. Wilsdon, who is professor of science and democracy at the University of Sussex, fears that young researchers are pushed to âpublish certain sorts of things [only] in certain sorts of placesâ in pursuit of âthe right numbers rather than the right questions. That is a tragedy if we want the best, brightest people to pursue the things that really matter.â
The report also flags up concerns that the use of bibliometrics unfairly disadvantages women (evidence suggests that men are reluctant to cite women) and interdisciplinary research (which tends to be cited less often than papers in the mainstream of disciplines).
The aforementioned use of metrics as targets for individual academics to achieve is another major concern of commentators. Critics fear that this increases incentives to cut corners or to cheat outright, undermining the integrity of research literature. There are concerns that such targets can even undermine the mental health of those struggling to meet the goals. The latter argument was voiced particularly vociferously at the end of last year when Stefan Grimm, a professor of toxicology at Imperial College London, committed suicide after being told that he was failing to bring in the level of grant income expected of an Imperial professor.
According to Wilsdon, grant income targets âat least have the advantage of relative simplicity compared to more algorithmically complex bibliometrics or altmetricsâ. But âif they are applied insensitively or elevated above everything else that makes up a rounded research and teaching portfolio, they can be harmful. They also need to be applied in a contextual way; if you demand that all researchers increase grant income at a time when public spending on research is flatlining or falling, you are setting goals that many simply wonât be able to meet.â
A particular bugbear of metrics critics is journal impact factor. This is a measure of the average number of citations received by papers in a particular journal over a particular period (usually two years). It is often pointed out that this average masks a very wide variation among the citations received by individual papers, with a few garnering very high numbers and many never garnering any. Nevertheless, journal impact factor is often used by hiring, promotion and grant review committees as a proxy for the quality of a particular paper or author.
ÌÇĐÄVlog
For their part, journals argue that this is not their fault, but some observers blame them for talking up their impact factors and doing everything they can to keep them high, such as constraining the number of papers they publish.
Sir Philip Campbell, editor-in-chief of Nature and a member of the reviewâs steering group, says that scientists âvalue highly selective journals, so a massive expansion of Nature, decreasing its selectivity, in order to take the heat out of the abuse of metrics would be perverseâ.
He insists that Nature selects papers on scientific merit alone and has repeatedly warned in editorials against misusing journal impact factor. But it is âone measure of a journalâs significance, and there is no reason not to publish its value when it is released, and to highlight any successesâ, he adds.
Nevertheless, the reviewâs report calls on publishers to âreduce emphasis on the journal impact factor as a promotional toolâ and to provide more article-level metrics and information on individual contributions to papers âto encourage a shift towards assessment based on the academic quality of an articleâ.
Another metrics-related concern highlighted in the report is universitiesâ obsession with their place in university league tables, whose claims to measure institutions against global standards are undermined by their âvarying degrees of arbitrariness in the weighting of different componentsâ. Wilsdon is particularly scathing of universities that adopt as a strategic aim the goal of reaching a certain position in rankings, which he sees as a dereliction of universitiesâ duty to come up with their own definitions of what constitutes quality.
For him, this is the primary illustration of the danger that a metric can become an end in itself: âPeople reach for things they can measureâŠrather than taking a step back and asking: âWhat kind of institution do we want to be? What kind of research profile and impact do we want to have, and what are the best targets and indicators to help us move in that direction?â
âIt is pretty tragic to see some of our greatest universities abandoning their pretence of having a proper strategic plan and objectives and essentially handing that job over to rankers,â he adds. âYou even hear of individual vice-chancellors having huge chunks of their pay packet pegged to performance in a league table. That is also an absolute abrogation of responsibility by governing councils, and whoever has signed off on that should be ashamed of themselves.â

Such abuses, and the opposition they whip up, may go some way towards explaining why, despite the trend towards management-by-numbers in the public sector, the research excellence framework â whose shape is largely fashioned by consensus among academics â is still a largely metrics-free zone. Nevertheless, concerns abound about the costs of such a huge peer review-driven exercise; a recent report by Rand Europe on the 2014 REF put the cost of the impact element alone at ÂŁ55 million â and the total cost, yet to be announced, is likely to be well in excess of ÂŁ100 million.
An attempt in 2006 by Gordon Brown, who was then chancellor, to turn the research assessment exercise into a metrics-driven exercise was eventually abandoned after a study by the ÌÇĐÄVlog Funding Council for England concluded that citation information was âinsufficiently robust to be used formulaically or as a primary indicator of qualityâ, the report notes. But REF panels were permitted to use metrics to inform their judgements, and the debate never quite went away.
Nor is it a debate that straightforwardly sets government officials against academics. It was reinvigorated in 2013 when Dorothy Bishop, professor of developmental neuropsychology at the University of Oxford, that a psychology departmentâs h-index â a hybrid measure of the volume and citation performance of its papers â during the assessment period for the 2008 RAE was a good predictor of how much quality-related funding it subsequently received.
Meanwhile, the years after 2008 saw the rapid rise to prominence of âaltmetricsâ, which track all manner of non-standard supposed indicators of quality, such as mentions or ratings on blogs or social media sites; readership on publishersâ websites or digital libraries; citation in non-scholarly sources such as policy documents; and popularity in scientific social networks such as Mendeley, ResearchGate and Academia.edu. As The Metric Tide notes, prominent publishers including Elsevier, Springer and Nature Publishing Group have all added altmetrics to articles in their digital collections.
Altmetrics have been talked up, in particular, as a potential way to quantify impact. Although assessment of impact by case studies is deemed to have been a success in the REF, in 2010 one of the first acts of the incoming universities and science minister, David Willetts, was to delay the REF by a year in order to assure himself that the methodology was robust. Given those doubts, and the pressure on public expenditure, it is perhaps not surprising that in one of his last major acts before stepping down as minister last summer, Willetts commissioned a second review, The Metric Tide, to examine the extent to which the time was now ripe for the greater use of metrics in research assessment in general, and the REF in particular.
However, on the use of metrics in the REF, the view set out in The Metric Tide remains in essence negative. Despite its acknowledgement of the various flaws and biases of peer review, the reviewâs steering group endorses the âcommon themeâ that emerged from the 153 responses to its call for evidence: that peer review remains the âgold standardâ of research assessment and that metrics can, at best, supplement it. Hence, although the limited use of citation data in the 2014 REF (typically in marginal cases or where there was disagreement) was ârelatively successfulâ and there is scope for it to be âenhancedâ, the report argues that peer review should remain the primary mechanism for assessment in the next exercise, which is likely to be in 2020.
âBibliometricians generally see citation rates as a proxy measure of academic impact or of impact on the relevant academic communities. ButâŠquality needs to be seen as a multidimensional concept that cannot be captured by any one indicator [and which] may vary by field and mission,â the report notes. This is particularly true in relation to altmetrics, which are âhighly specific to the types of impact concernedâ, while the definition of impact adopted by the REF is, rightly, very wide. So adoption of specific altmetrics to assess impact would create âa danger that the concept of impact might narrow and become too specifically defined by the ready availability of indicatorsâŠpotentially constraining the overall diversity of the UKâs research baseâ. Worse still, most altmetrics are easy to manipulate and, given that more than ÂŁ1 billion of funding a year is directly dependent on the REF results, almost inevitably would be.
A major impediment to the greater use of standard bibliometrics in the REF is the still scant coverage of the humanities and some social sciences in the existing databases. The report points out, for instance, that about half of citations for monographs occur in publications other than the journals that dominate such databases. One solution to that problem â already adopted in Italy and Australia (see âGlobal currents: how other countries use metricsâ box, below) â would be to assess science subjects on the basis of citations but keep peer review for humanities and social sciences. But, according to Steven Hill, head of research policy at Hefce and a member of the reportâs steering group, the review was warned that such a âhybridâ approach could result in institutions and funders regarding the humanities as less worthwhile. That fear is echoed by Eleonora Belfiore, associate professor of cultural policy at the University of Warwick, who was also part of the steering group. She says that a âtwo-speedâ solution would âpush the arts and humanities further to the margin of the sector and of policyâ, and âfurther disincentivise database providers from trying to be more inclusive of humanities publicationsâ.
Besides, she adds, such âspecial pleadingâ would not be justified by the evidence because, beyond a few âparticular challengesâ for the humanities, she has been rather surprised to discover that the concerns around the use of metrics are largely common across all fields.
If peer review is agreed to be the gold standard of research assessment, then the extent to which metrics could replace peer review depends on the extent to which judgements generated by metrics mirror those arrived at through peer review. Bishopâs analysis was followed up last year with a paper by four physicists that made predictions about the ranking of departments in four fields in the REF â physics, chemistry, biology and sociology â on the basis of their h-indices. A compared those predictions with REF results and revealed that although there were correlations between the hâindices and the final scores, they were not nearly robust enough to replace peer review. At the time, one of the authors, Ralph Kenna, professor of theoretical physics at Coventry University, said that universities would get more accurate predictions of their likely movement in the REF rankings âby tossing diceâ.
On the other hand, at the same March conference at which Wilsdon pre-announced the reviewâs conclusions about the REF, Nick Fowler, managing director of research management at Elsevier, drew attention to a high correlation between the amount of QR funding awarded to institutions in 2015-16 and the number of highly cited articles they produced during the assessment period. This was particularly true among smaller institutions; for larger ones, there was a stronger correlation between REF scores and departmentsâ âfield-weighted citation impactâ, which accounts for differences in citation between papers of different types, scopes and ages.
ÌÇĐÄVlog
âMetrics are a type of evidenceâŠUsed well, they can be very powerful,â Fowler argued. He repeated the view that metrics could play a greater role in the REF in at least some subjects following a more recent analysis of metrics and the REF carried out by Elsevier for THE. The review steering group asked Hefce to look deeper into the issue by examining correlations between a whole raft of metrics and the peer review scores for individual outputs. Although it found some reasonable correlations (see âAnalyse this: numbers struggle to predict REF scoresâ box, below), it concluded, like Kenna, that these are not strong enough to justify any move to replace peer review with metrics.
For his part, Stephen Curry, professor of structural biology at Imperial and a member of the reportâs steering group, dismisses the idea that peer review is a âgold standardâ, and he is partly persuaded that, at a âhigh enough level of granularityâ, such as entire departments, âthe numbers do start to mean a bit more because you are accumulating some indicator of average behaviourâ. Nevertheless, he still believes that research activity is âa complex businessâ that he cannot envisage ever being âreducible to [any] basket of metricsâ. Besides which, he adds, carrying out the REF by peer review has its own value in that it demonstrates higher educationâs willingness to put itself through âthis rather gruelling assessment periodâ.
âThe UK can wear that as a badge of honour [because] it demonstrates it is immensely self-critical,â he says.
Hill agrees that it is âhard to see how you would get sufficient accountability about the way in which public funding is being spent â rather than [merely] allocated â through an entirely metrics-driven approachâ. Furthermore, âthe kinds of metrics you might use to allocate funding at macro level might not provide the nuanced information institutions want about their research performanceâ. He is also sceptical of peer reviewâs status as a gold standard, but adds that this does not mean that metrics are superior.
âThe idea we developed during the review is that different ways of measuring quality will inevitably come up with different answersâŠSo the best you can do is get as rounded a view as you can by looking at a range of different [measurements]. This is at the heart of the responsible metrics argument: everything has to be seen in context, and the more information you have â as long as it is reliable â the better.â

According to the report, responsible metric use involves being transparent about the use of a range of robust metrics that are inclusive of all fields, while bearing in mind the potential wider effects of their use and âupdating them in responseâ. Curry admits that this notion of responsibility is not a new one: it has already been pushed in recent declarations against the misuse of metrics, such as 2013âs San Francisco Declaration on Research Assessment and 2015âs .
The former, popularly known as , âisnât saying you canât publish in journals with a high impact factor: it is saying that when you do assessment and hiring, you must look for a better way than just relying on the name of the journal â it is a fairly modest askâ. Despite this, he notes, very few UK universities have yet signed it. âThat, to me, tells you an awful lot about the present culture,â Curry says. Universities see the use of journal impact factors as âthe game we play, the one everybody understands so [the thinking goes] we would be fools not to because it would hurt our bottom line.â
Curry says that the steering group debated whether to call on universities to sign Dora but settled instead for suggesting that they might want to as part of a process of developing âa clear statement of principles on their approach to research management and assessment, including the role of quantitative indicatorsâ, on which basis they should âcarefully select quantitative indicators that are appropriate to their institutional contextâ.
Wilsdon admits that it might be âa bit idealisticâ to hope that reflection will, in itself, end bad practice (in which academics themselves are sometimes complicit â by, for example, putting their hâindex in their email signatures). But he points out that the report also contains a raft of practical, technical measures to improve the accuracy of metrics. These revolve around an insistence that all institutions, outputs and individuals submitted to the next REF should have unique numerical identifiers to avoid ambiguities and to increase interoperability between the various systems used to capture research data, such as Researchfish, Gateway to Research and universitiesâ internal databases.
âIf you talk about administrative burden and only focus on the REF, you are missing a lot,â Wilsdon says. âThe real issue is how do you stop [academics] having to enter the same information 16 times? That is much more important, in a sense, than what we do at a policy level in the REF.â
More interoperability could also help to avoid a danger of which Wilsdon is keenly conscious: that by recommending even a modest rise in the use of metrics in the next REF without a corresponding drop in the peer review load, the report risks increasing still further the bureaucratic burden and cost of the exercise â particularly if, as in 2014, institutions still feel the need to manually check their citation data.
Hefce is not obliged to accept the reviewâs recommendations, but according to Hill, there is little in it that the funding council disagrees with. Hefce, he continues, will âfeed thinking from itâ into its own consultation on the shape of the next REF, which will open in the autumn, with decisions in late spring 2016.
Curry admits to being a bit disappointed that no âgrand new visionâ crystallised out of the steering groupâs deliberations and is conscious that the report might come across as mere âcommon senseâ.
âBut people donât always see what the common-sense thing to do is, and I hope that message comes across loud and clear that responsibility [is required]. It is very much a human, complex activity we are looking at, and we have to do our best by it: use tools available to us as best we can but donât let them run amok.â
Wilsdon also admits that the report risks being criticised for recommending an excessively âincrementalâ approach. âBut most people will recognise we have avoided proposing radical change for the sake of it, because we donât think the evidence supports it.â
He hopes that, at the very least, the report will help the academy to move âbeyond the situation where managers are uncritically defending the utility of metricsâŠand critics are saying that academia is so ineffably precious it canât possibly be captured in any metrics â âleave us alone to get on with what we are doing and please give us more moneyâ.â
For Wilsdon, metrics are no more intrinsically good or bad than controversial technologies such as genetically modified crops.
âYou can have a debate about the positive ways they could be used [set against] how they could cause problems and the inappropriate ways they could be used,â he says. âWe need that degree of sophistication in relation to research management. If we are entering a world drenched in big data, the research system isnât going to sit outside that. The question is how we use the power of real-time data collection and analysis of our own activity to shape the kind of research culture and system we want, rather than allowing crude, inappropriate uses to steer us off in directions we donât want to go.â
Global currents: how other countries use metrics
Metrics play varying roles in the research assessment exercises carried out by other countries, the review learned.
The quality element of the six-yearly reviews conducted for New Zealandâs Performance-Based Research Fund â assessing individuals rather than departments â is based on peer review.
By contrast, the Danish Bibliometric Research Indicator (BFI), on the basis of which a quarter of public university funding is allocated, is based on points assigned according to the types and publication destinations of research outputs. Every year expert panels draw up a list of publishers, journals and book and conference series that they consider to be reputable, and split them into two quality levels.
Meanwhile, hybrid approaches are adopted in both Italy and Australia, with the humanities and social sciences assessed on the basis of peer review, and the natural sciences and engineering judged by metrics.
Italyâs Evaluation of the Quality of Research (VQR) involves journal impact factor and citation counting.
Australiaâs Excellence in Research for Australia (on which no funding rides) involves a basket of measures, some of which â patents, plant breedersâ rights, registered designs and research commercialisation income â are intended to capture wider impact. But the UK metrics steering group heard there were concerns about the âimplied narrow definition of societal impact and the potential that focusing on a small number of metrics might significantly skew behaviourâ, as well as about the hybrid approachâs âpotential to lead to perceived hierarchies which may cause significant tension between disciplinary groupsâ.
Analyse this: numbers struggle to predict REF scores
The metrics steering group asked the ÌÇĐÄVlog Funding Council for England to dig into the fine detail of correlations between bibliometrics and peer review scores in the research excellence framework by examining the relationship between a suite of 15 metrics and the scores given to individual outputs across all units of assessment in 2014.
The analysis finds that metrics had a relatively poor ability to predict which papers were given a 4* rating, although correlations are higher in certain fields, such as clinical medicine, biological sciences, chemistry and economics.
The metric with the highest overall correlation was SCImago Journal Rank â a citations-based measure of the importance of journals. However, the correlation was only 0.34, where 1 is a perfect correlation. The most useful metrics for predicting REF scores across a wide range of fields were SCImago Journal Rank and Google Scholar citations.
Where metrics ratings were the same, papers by early career researchers under main panel A (biological sciences) were more likely to be deemed 4*, while those authored by early career researchers under main panel C (social sciences) were less likely to be deemed 4* than those written by more senior academics, especially in economics and social policy.
ÌÇĐÄVlog
Female authors were less likely than men to have 4* papers, especially in main panels B (physical science) and C. However, female authorship did not correlate significantly with metrics scores.
POSTSCRIPT:
Article originally published as: The weight of numbers (9 July 2015)
Register to continue
Why register?
- Registration is free and only takes a moment
- Once registered, you can read 3 articles a month
- Sign up for our newsletter
Subscribe
Or subscribe for unlimited access to:
- Unlimited access to news, views, insights & reviews
- Digital editions
- Digital access to °Ő±á·Ąâs university and college rankings analysis
Already registered or a current subscriber?








