Covering index for $group/$sum in MongoDB aggregation (with hint)
MongoDB indexes enable quick equality and range filtering (with $match) and ordering (with $sort). For aggregation operations like $group with $first or $last accumulators, indexes can dramatically improve performance through the DISTINCT_SCAN optimization. However, for hash-based aggregations such as $group with $sum or $sortByCount, there’s not such optimization. While indexes can provide significant performance benefits by creating covered query plans (avoiding document fetches), the query planner won’t automatically select them. You must explicitly specify the index using a hint.
Setup
I created a test collection with 50,000 documents. Each document has a “groupme” field with 1000 distinct values, and a filler field (a random 50KB base64 string).
base64 -w 51200 /dev/urandom |
awk 'NR>50000{exit}{print int(1000*rand())"t"$0}' |
mongoimport -c demo -f groupme,filler --type=tsv /dev/stdin --drop
I continue with mongosh and use an aggregation pipeline to group by “groupme” and count the documents:
db.demo.countDocuments()
50000
db.demo.aggregate([ { $sortByCount: "$groupme" } ])
[
{ _id: 937, count: 78 }, { _id: 517, count: 71 },
{ _id: 798, count: 70 }, { _id: 182, count: 68 },
{ _id: 812, count: 68 }, { _id: 158, count: 67 },
{ _id: 60, count: 67 }, { _id: 157, count: 66 },
{ _id: 653, count: 66 }, { _id: 901, count: 66 },
{ _id: 450, count: 66 }, { _id: 587, count: 66 },
{ _id: 701, count: 66 }, { _id: 403, count: 66 },
{ _id: 110, count: 65 }, { _id: 757, count: 65 },
{ _id: 461, count: 65 }, { _id: 593, count: 65 },
{ _id: 418, count: 65 }, { _id: 39, count: 64 }
]
Type "it" for more
Collection Scan
I executed it again to get the execution plan with execution statistics:
x=db.demo.aggregate([
{ $sortByCount: "$groupme" }
]).explain("executionStats")
In my small lab, the aggregation ran for 580 milliseconds:
print(x.stages[0].executionTimeMillisEstimate)
Long('590')
The execution plan shows a collection scan.
print(x.stages[0].$cursor.queryPlanner.winningPlan.queryPlan)
{
stage: 'GROUP',
planNodeId: 3,
inputStage: {
stage: 'COLLSCAN',
planNodeId: 1,
filter: {},
direction: 'forward'
}
}
According to execution statistics, the elapsed time is primarily caused by the scan operation, which took 549 milliseconds, and the aggregation step:
print(x.stages[0].$cursor.executionStats.executionStages.inputStage.inputStage)
{
stage: 'group',
planNodeId: 3,
nReturned: 1000,
executionTimeMillisEstimate: 579,
opens: 1,
closes: 1,
saveState: 37,
restoreState: 36,
isEOF: 1,
groupBySlots: [ Long('5') ],
expressions: { '6': 'count() ', initExprs: { '6': null } },
mergingExprs: { '4': 'sum(s4) ' },
usedDisk: false,
spills: 0,
spilledBytes: 0,
spilledRecords: 0,
spilledDataStorageSize: 0,
inputStage: {
stage: 'project',
planNodeId: 3,
nReturned: 50000,
executionTimeMillisEstimate: 569,
opens: 1,
closes: 1,
saveState: 37,
restoreState: 36,
isEOF: 1,
projections: { '5': '(s3 ?: null) ' },
inputStage: {
stage: 'scan',
planNodeId: 1,
nReturned: 50000,
executionTimeMillisEstimate: 549,
opens: 1,
closes: 1,
saveState: 37,
restoreState: 36,
isEOF: 1,
numReads: 50000,
recordSlot: 1,
recordIdSlot: 2,
scanFieldNames: [ 'groupme' ],
scanFieldSlots: [ Long('3') ]
}
}
}
Index on the grouping key
To demonstrate the benefit of a covering index, I create an index on the grouping key:
db.demo.createIndex(
{ groupme: 1 }
);
The query planner doesn’t use the index and the query still uses a collection scan:
x=db.demo.aggregate([
{ $sortByCount: "$groupme" }
]).explain("executionStats")
print(x.stages[0].$cursor.queryPlanner.winningPlan.queryPlan)
{
stage: 'GROUP',
planNodeId: 3,
inputStage: {
stage: 'COLLSCAN',
planNodeId: 1,
filter: {},
direction: 'forward'
}
}
MongoDB’s query planner doesn’t automatically choose covering indexes, but I can force it with a hint:
x=db.demo.aggregate([
{ $sortByCount: "$groupme" }
],
{ hint: { groupme: 1 } } ).explain("executionStats")
print(x.stages[0].$cursor.queryPlanner.winningPlan.queryPlan)
{
stage: 'GROUP',
planNodeId: 3,
inputStage: {
stage: 'PROJECTION_COVERED',
planNodeId: 2,
transformBy: { groupme: true, _id: false },
inputStage: {
stage: 'IXSCAN',
planNodeId: 1,
keyPattern: { groupme: 1 },
indexName: 'groupme_1',
isMultiKey: false,
multiKeyPaths: { groupme: [] },
isUnique: false,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: { groupme: [ '[MinKey, MaxKey]' ] }
}
}
}
The execution plan uses an index scan with a projection that covers the grouping columns. Because the query accesses only the small index entries, not the large documents, execution time dropped to 19 milliseconds:
x=db.demo.aggregate([
{ $sortByCount: "$groupme" }
],
{ hint: { groupme: 1 } } ).explain("executionStats")
print(x.stages[0].executionTimeMillisEstimate)
Long('22')
print(x.stages[0].$cursor.executionStats.executionStages.inputStage.inputStage)
{
stage: 'group',
planNodeId: 3,
nReturned: 1000,
executionTimeMillisEstimate: 20,
opens: 1,
closes: 1,
saveState: 8,
restoreState: 7,
isEOF: 1,
groupBySlots: [ Long('4') ],
expressions: { '5': 'count() ', initExprs: { '5': null } },
mergingExprs: { '3': 'sum(s3) ' },
usedDisk: false,
spills: 0,
spilledBytes: 0,
spilledRecords: 0,
spilledDataStorageSize: 0,
inputStage: {
stage: 'project',
planNodeId: 3,
nReturned: 50000,
executionTimeMillisEstimate: 10,
opens: 1,
closes: 1,
saveState: 8,
restoreState: 7,
isEOF: 1,
projections: { '4': '(s2 ?: null) ' },
inputStage: {
stage: 'ixseek',
planNodeId: 1,
nReturned: 50000,
executionTimeMillisEstimate: 10,
opens: 1,
closes: 1,
saveState: 8,
restoreState: 7,
isEOF: 1,
indexName: 'groupme_1',
keysExamined: 50000,
seeks: 1,
numReads: 50001,
recordIdSlot: 1,
outputSlots: [ Long('2') ],
indexKeysToInclude: '00000000000000000000000000000001',
seekKeyLow: 'KS(0A01) ',
seekKeyHigh: 'KS(F0FE) '
}
}
}
While some databases use different aggregation algorithms when the input data is ordered, MongoDB always relies on a hash-based algorithm. I plan to test two indexes to determine if one with a prefix matching the grouping key is more efficient.
Index not starting with the grouping key
I create an index that is not ordered on the grouping key but only covers it:
db.demo.createIndex(
{ _id:1, groupme: 1 }
);
The index is slightly larger, and the scan takes a few additional milliseconds:
x=db.demo.aggregate([
{ $sortByCount: "$groupme" }
],
{ hint: { _id:1, groupme: 1 } } ).explain("executionStats")
print(x.stages[0].executionTimeMillisEstimate)
Long('24')
print(x.stages[0].$cursor.executionStats.executionStages.inputStage.inputStage)
{
stage: 'group',
planNodeId: 3,
nReturned: 1000,
executionTimeMillisEstimate: 19,
opens: 1,
closes: 1,
saveState: 8,
restoreState: 7,
isEOF: 1,
groupBySlots: [ Long('4') ],
expressions: { '5': 'count() ', initExprs: { '5': null } },
mergingExprs: { '3': 'sum(s3) ' },
usedDisk: false,
spills: 0,
spilledBytes: 0,
spilledRecords: 0,
spilledDataStorageSize: 0,
inputStage: {
stage: 'project',
planNodeId: 3,
nReturned: 50000,
executionTimeMillisEstimate: 19,
opens: 1,
closes: 1,
saveState: 8,
restoreState: 7,
isEOF: 1,
projections: { '4': '(s2 ?: null) ' },
inputStage: {
stage: 'ixseek',
planNodeId: 1,
nReturned: 50000,
executionTimeMillisEstimate: 19,
opens: 1,
closes: 1,
saveState: 8,
restoreState: 7,
isEOF: 1,
indexName: '_id_1_groupme_1',
keysExamined: 50000,
seeks: 1,
numReads: 50001,
recordIdSlot: 1,
outputSlots: [ Long('2') ],
indexKeysToInclude: '00000000000000000000000000000010',
seekKeyLow: 'KS(0A0A01) ',
seekKeyHigh: 'KS(F0F0FE) '
}
}
}
For comparison, I created another index with the same fields in the key, but arranged in a different order, to match the size and start with the grouping key:
db.demo.createIndex(
{ groupme: 1, _id:1 }
);
x=db.demo.aggregate([
{ $sortByCount: "$groupme" }
],
{ hint: { groupme: 1, _id:1 } } ).explain("executionStats")
print(x.stages[0].executionTimeMillisEstimate)
Long('22')
print(x.stages[0].$cursor.executionStats.executionStages.inputStage.inputStage)
{
stage: 'group',
planNodeId: 3,
nReturned: 1000,
executionTimeMillisEstimate: 20,
opens: 1,
closes: 1,
saveState: 8,
restoreState: 7,
isEOF: 1,
groupBySlots: [ Long('4') ],
expressions: { '5': 'count() ', initExprs: { '5': null } },
mergingExprs: { '3': 'sum(s3) ' },
usedDisk: false,
spills: 0,
spilledBytes: 0,
spilledRecords: 0,
spilledDataStorageSize: 0,
inputStage: {
stage: 'project',
planNodeId: 3,
nReturned: 50000,
executionTimeMillisEstimate: 20,
opens: 1,
closes: 1,
saveState: 8,
restoreState: 7,
isEOF: 1,
projections: { '4': '(s2 ?: null) ' },
inputStage: {
stage: 'ixseek',
planNodeId: 1,
nReturned: 50000,
executionTimeMillisEstimate: 0,
opens: 1,
closes: 1,
saveState: 8,
restoreState: 7,
isEOF: 1,
indexName: 'groupme_1__id_1',
keysExamined: 50000,
seeks: 1,
numReads: 50001,
recordIdSlot: 1,
outputSlots: [ Long('2') ],
indexKeysToInclude: '00000000000000000000000000000001',
seekKeyLow: 'KS(0A0A01) ',
seekKeyHigh: 'KS(F0F0FE) '
}
}
}
The execution time remains similar, and the main benefit of the index is realized when it includes the fields used in the $group stage, no matter their order.
Forcing disk spill
To compare the work done with the two indexes, I force a disk spill by setting the hash aggregation to a value smaller than the aggregate:
db.adminCommand({
setParameter: 1,
internalQuerySlotBasedExecutionHashAggApproxMemoryUseInBytesBeforeSpill: 100
}).was
Long('104857600')
Both indexes show 50000 disk spill when grouping 50000 entries:
x=db.demo.aggregate([
{ $sortByCount: "$groupme" }
],
{ hint: { groupme: 1, _id:1 } } ).explain("executionStats")
print(x.stages[0].executionTimeMillisEstimate)
Long('59')
print(x.stages[0].$cursor.executionStats.executionStages.inputStage.inputStage.spills)
996
x=db.demo.aggregate([
{ $sortByCount: "$groupme" }
],
{ hint: { _id:1, groupme: 1 } } ).explain("executionStats")
print(x.stages[0].executionTimeMillisEstimate)
Long('617')
print(x.stages[0].$cursor.executionStats.executionStages.inputStage.inputStage.spills)
16667
When the aggregation needs to spill to disk due to many distinct values for the grouping key or reduced hash aggregation memory, the index starting with the grouping key offers an advantage.
I reset the configuration to the default of 100MB:
db.adminCommand({
setParameter: 1,
internalQuerySlotBasedExecutionHashAggApproxMemoryUseInBytesBeforeSpill: 104857600
}).was
Long('1')
Conclusion
For aggregations, especially when documents are large, using a covering index can greatly improve performance by allowing MongoDB to read compact index entries instead of full documents. Because MongoDB’s query planner does not automatically select covering indexes for these workloads, you typically need to use a hint to force their use, even when there is no filter or sort. The most important factor is ensuring the index covers the fields used by the $group stage, as this eliminates unnecessary document fetches. When enough memory is available, the order of fields in the index has little impact. However, under memory pressure that causes hash aggregation to spill to disk, particularly with a high number of groups, an index that begins with the grouping field can provide additional performance benefits by processing groups in order and reducing repeated spills.
I used $sortByCount to demonstrate this point, but the principle also applies when using $group with the $sum accumulator, as $sortByCount internally uses {$sum: 1}. For $first or $last accumulators, MongoDB takes advantage of the index automatically, without needing a hint, since it is like a filter, retrieving the first or last entry from the index for each group.