MongoDB indexes allow fast equality and vary filtering (with $match) and ordering (with $kind). For aggregation operations like $group with $first or $final accumulators, indexes can dramatically enhance efficiency by way of the DISTINCT_SCAN optimization.
Nevertheless, for hash-based aggregations similar to $group with $sum or $sortByCount, there is no such optimization. Whereas indexes can present vital efficiency advantages by creating coated question plans (avoiding doc fetches), the question planner will not robotically choose them. You need to explicitly specify the index utilizing a touch.
Setup a group for the check
I created a check assortment with 50,000 paperwork. Every doc has a “groupme” discipline with 1000 distinct values, and a filler discipline (a random 50KB base64 string).
base64 -w 51200 /dev/urandom |
awk 'NR>50000{exit}{print int(1000*rand())"t"$0}' |
mongoimport -c demo -f groupme,filler --type=tsv /dev/stdin --drop
I proceed with mongosh and use an aggregation pipeline to group by “groupme” and depend the paperwork:
db.demo.countDocuments()
50000
db.demo.combination([ { $sortByCount: "$groupme" } ])
[
{ _id: 937, count: 78 }, { _id: 517, count: 71 },
{ _id: 798, count: 70 }, { _id: 182, count: 68 },
{ _id: 812, count: 68 }, { _id: 158, count: 67 },
{ _id: 60, count: 67 }, { _id: 157, count: 66 },
{ _id: 653, count: 66 }, { _id: 901, count: 66 },
{ _id: 450, count: 66 }, { _id: 587, count: 66 },
{ _id: 701, count: 66 }, { _id: 403, count: 66 },
{ _id: 110, count: 65 }, { _id: 757, count: 65 },
{ _id: 461, count: 65 }, { _id: 593, count: 65 },
{ _id: 418, count: 65 }, { _id: 39, count: 64 }
]
Kind "it" for extra
Assortment scan
I executed it once more to get the execution plan with execution statistics:
x=db.demo.combination([
{ $sortByCount: "$groupme" }
]).clarify("executionStats")
In my small lab, the aggregation ran for 580 milliseconds:
print(x.levels[0].executionTimeMillisEstimate)
Lengthy('590')
The execution plan exhibits a group scan.
print(x.levels[0].$cursor.queryPlanner.winningPlan.queryPlan)
{
stage: 'GROUP',
planNodeId: 3,
inputStage: {
stage: 'COLLSCAN',
planNodeId: 1,
filter: {},
route: 'ahead'
}
}
In keeping with execution statistics, the elapsed time is primarily brought on by the scan operation, which took 549 milliseconds, and the aggregation step:
print(x.levels[0].$cursor.executionStats.executionStages.inputStage.inputStage)
{
stage: 'group',
planNodeId: 3,
nReturned: 1000,
executionTimeMillisEstimate: 579,
opens: 1,
closes: 1,
saveState: 37,
restoreState: 36,
isEOF: 1,
groupBySlots: [ Long('5') ],
expressions: { '6': 'depend() ', initExprs: { '6': null } },
mergingExprs: { '4': 'sum(s4) ' },
usedDisk: false,
spills: 0,
spilledBytes: 0,
spilledRecords: 0,
spilledDataStorageSize: 0,
inputStage: {
stage: 'mission',
planNodeId: 3,
nReturned: 50000,
executionTimeMillisEstimate: 569,
opens: 1,
closes: 1,
saveState: 37,
restoreState: 36,
isEOF: 1,
projections: { '5': '(s3 ?: null) ' },
inputStage: {
stage: 'scan',
planNodeId: 1,
nReturned: 50000,
executionTimeMillisEstimate: 549,
opens: 1,
closes: 1,
saveState: 37,
restoreState: 36,
isEOF: 1,
numReads: 50000,
recordSlot: 1,
recordIdSlot: 2,
scanFieldNames: [ 'groupme' ],
scanFieldSlots: [ Long('3') ]
}
}
}
Index on the grouping key
To show the advantage of a overlaying index, I create an index on the grouping key:
db.demo.createIndex(
{ groupme: 1 }
);
The question planner does not use the index and the question nonetheless makes use of a group scan:
x=db.demo.combination([
{ $sortByCount: "$groupme" }
]).clarify("executionStats")
print(x.levels[0].$cursor.queryPlanner.winningPlan.queryPlan)
{
stage: 'GROUP',
planNodeId: 3,
inputStage: {
stage: 'COLLSCAN',
planNodeId: 1,
filter: {},
route: 'ahead'
}
}
MongoDB’s question planner does not robotically select overlaying indexes, however I can pressure it with a touch:
x=db.demo.combination([
{ $sortByCount: "$groupme" }
],
{ trace: { groupme: 1 } } ).clarify("executionStats")
print(x.levels[0].$cursor.queryPlanner.winningPlan.queryPlan)
{
stage: 'GROUP',
planNodeId: 3,
inputStage: {
stage: 'PROJECTION_COVERED',
planNodeId: 2,
transformBy: { groupme: true, _id: false },
inputStage: {
stage: 'IXSCAN',
planNodeId: 1,
keyPattern: { groupme: 1 },
indexName: 'groupme_1',
isMultiKey: false,
multiKeyPaths: { groupme: [] },
isUnique: false,
isSparse: false,
isPartial: false,
indexVersion: 2,
route: 'ahead',
indexBounds: { groupme: [ '[MinKey, MaxKey]' ] }
}
}
}
The execution plan makes use of an index scan with a projection that covers the grouping columns. As a result of the question accesses solely the small index entries, not the massive paperwork, execution time dropped to 19 milliseconds:
x=db.demo.combination([
{ $sortByCount: "$groupme" }
],
{ trace: { groupme: 1 } } ).clarify("executionStats")
print(x.levels[0].executionTimeMillisEstimate)
Lengthy('22')
print(x.levels[0].$cursor.executionStats.executionStages.inputStage.inputStage)
{
stage: 'group',
planNodeId: 3,
nReturned: 1000,
executionTimeMillisEstimate: 20,
opens: 1,
closes: 1,
saveState: 8,
restoreState: 7,
isEOF: 1,
groupBySlots: [ Long('4') ],
expressions: { '5': 'depend() ', initExprs: { '5': null } },
mergingExprs: { '3': 'sum(s3) ' },
usedDisk: false,
spills: 0,
spilledBytes: 0,
spilledRecords: 0,
spilledDataStorageSize: 0,
inputStage: {
stage: 'mission',
planNodeId: 3,
nReturned: 50000,
executionTimeMillisEstimate: 10,
opens: 1,
closes: 1,
saveState: 8,
restoreState: 7,
isEOF: 1,
projections: { '4': '(s2 ?: null) ' },
inputStage: {
stage: 'ixseek',
planNodeId: 1,
nReturned: 50000,
executionTimeMillisEstimate: 10,
opens: 1,
closes: 1,
saveState: 8,
restoreState: 7,
isEOF: 1,
indexName: 'groupme_1',
keysExamined: 50000,
seeks: 1,
numReads: 50001,
recordIdSlot: 1,
outputSlots: [ Long('2') ],
indexKeysToInclude: '00000000000000000000000000000001',
seekKeyLow: 'KS(0A01) ',
seekKeyHigh: 'KS(F0FE) '
}
}
}
Whereas some databases use totally different aggregation algorithms when the enter information is ordered, MongoDB all the time depends on a hash-based algorithm. I plan to check two indexes to find out if one with a prefix matching the grouping key’s extra environment friendly.
Index not beginning with the grouping key
I create an index that’s not ordered on the grouping key however solely covers it:
db.demo.createIndex(
{ _id:1, groupme: 1 }
);
The index is barely bigger, and the scan takes a number of further milliseconds:
x=db.demo.combination([
{ $sortByCount: "$groupme" }
],
{ trace: { _id:1, groupme: 1 } } ).clarify("executionStats")
print(x.levels[0].executionTimeMillisEstimate)
Lengthy('24')
print(x.levels[0].$cursor.executionStats.executionStages.inputStage.inputStage)
{
stage: 'group',
planNodeId: 3,
nReturned: 1000,
executionTimeMillisEstimate: 19,
opens: 1,
closes: 1,
saveState: 8,
restoreState: 7,
isEOF: 1,
groupBySlots: [ Long('4') ],
expressions: { '5': 'depend() ', initExprs: { '5': null } },
mergingExprs: { '3': 'sum(s3) ' },
usedDisk: false,
spills: 0,
spilledBytes: 0,
spilledRecords: 0,
spilledDataStorageSize: 0,
inputStage: {
stage: 'mission',
planNodeId: 3,
nReturned: 50000,
executionTimeMillisEstimate: 19,
opens: 1,
closes: 1,
saveState: 8,
restoreState: 7,
isEOF: 1,
projections: { '4': '(s2 ?: null) ' },
inputStage: {
stage: 'ixseek',
planNodeId: 1,
nReturned: 50000,
executionTimeMillisEstimate: 19,
opens: 1,
closes: 1,
saveState: 8,
restoreState: 7,
isEOF: 1,
indexName: '_id_1_groupme_1',
keysExamined: 50000,
seeks: 1,
numReads: 50001,
recordIdSlot: 1,
outputSlots: [ Long('2') ],
indexKeysToInclude: '00000000000000000000000000000010',
seekKeyLow: 'KS(0A0A01) ',
seekKeyHigh: 'KS(F0F0FE) '
}
}
}
For comparability, I created one other index with the identical fields in the important thing, however organized in a special order, to match the scale and begin with the grouping key:
db.demo.createIndex(
{ groupme: 1, _id:1 }
);
x=db.demo.combination([
{ $sortByCount: "$groupme" }
],
{ trace: { groupme: 1, _id:1 } } ).clarify("executionStats")
print(x.levels[0].executionTimeMillisEstimate)
Lengthy('22')
print(x.levels[0].$cursor.executionStats.executionStages.inputStage.inputStage)
{
stage: 'group',
planNodeId: 3,
nReturned: 1000,
executionTimeMillisEstimate: 20,
opens: 1,
closes: 1,
saveState: 8,
restoreState: 7,
isEOF: 1,
groupBySlots: [ Long('4') ],
expressions: { '5': 'depend() ', initExprs: { '5': null } },
mergingExprs: { '3': 'sum(s3) ' },
usedDisk: false,
spills: 0,
spilledBytes: 0,
spilledRecords: 0,
spilledDataStorageSize: 0,
inputStage: {
stage: 'mission',
planNodeId: 3,
nReturned: 50000,
executionTimeMillisEstimate: 20,
opens: 1,
closes: 1,
saveState: 8,
restoreState: 7,
isEOF: 1,
projections: { '4': '(s2 ?: null) ' },
inputStage: {
stage: 'ixseek',
planNodeId: 1,
nReturned: 50000,
executionTimeMillisEstimate: 0,
opens: 1,
closes: 1,
saveState: 8,
restoreState: 7,
isEOF: 1,
indexName: 'groupme_1__id_1',
keysExamined: 50000,
seeks: 1,
numReads: 50001,
recordIdSlot: 1,
outputSlots: [ Long('2') ],
indexKeysToInclude: '00000000000000000000000000000001',
seekKeyLow: 'KS(0A0A01) ',
seekKeyHigh: 'KS(F0F0FE) '
}
}
}
The execution time stays related, and the primary advantage of the index is realized when it contains the fields used within the $group stage, irrespective of their order.
Forcing disk spill
To match the work achieved with the 2 indexes, I pressure a disk spill by setting the hash aggregation to a price smaller than the mixture:
db.adminCommand({
setParameter: 1,
internalQuerySlotBasedExecutionHashAggApproxMemoryUseInBytesBeforeSpill: 100
}).was
Lengthy('104857600')
Each indexes present 50,000 disk spill when grouping 50,000 entries:
x=db.demo.combination([
{ $sortByCount: "$groupme" }
],
{ trace: { groupme: 1, _id:1 } } ).clarify("executionStats")
print(x.levels[0].executionTimeMillisEstimate)
Lengthy('59')
print(x.levels[0].$cursor.executionStats.executionStages.inputStage.inputStage.spills)
996
x=db.demo.combination([
{ $sortByCount: "$groupme" }
],
{ trace: { _id:1, groupme: 1 } } ).clarify("executionStats")
print(x.levels[0].executionTimeMillisEstimate)
Lengthy('617')
print(x.levels[0].$cursor.executionStats.executionStages.inputStage.inputStage.spills)
16667
When the aggregation must spill to disk attributable to many distinct values for the grouping key or lowered hash aggregation reminiscence, the index beginning with the grouping key provides a bonus.
I reset the configuration to the default of 100MB:
db.adminCommand({
setParameter: 1,
internalQuerySlotBasedExecutionHashAggApproxMemoryUseInBytesBeforeSpill: 104857600
}).was
Lengthy('1')
Conclusion
For aggregations, particularly when paperwork are massive, utilizing a overlaying index can enormously enhance efficiency by permitting MongoDB to learn compact index entries as an alternative of full paperwork. As a result of MongoDB’s question planner doesn’t robotically choose overlaying indexes for these workloads, you sometimes want to make use of a touch to pressure their use, even when there isn’t a filter or kind. An important issue is making certain the index covers the fields utilized by the $group stage, as this eliminates pointless doc fetches. When sufficient reminiscence is on the market, the order of fields within the index has little affect. Nevertheless, underneath reminiscence strain that causes hash aggregation to spill to disk, notably with a excessive variety of teams, an index that begins with the grouping discipline can present further efficiency advantages by processing teams so as and decreasing repeated spills.
I used $sortByCount to show this level, however the precept additionally applies when utilizing $group with the $sum accumulator, as $sortByCount internally makes use of {$sum: 1}. For $first or $final accumulators, MongoDB takes benefit of the index robotically, with no need a touch, because it is sort of a filter, retrieving the primary or final entry from the index for every group.
