Prerequisites
🚀 Feature Proposal
I'm opening this issue as a place to discuss potential improvements for limit and perDocumentLimit with populate, thoughts may not be very organized since we have so many options.
Mongoose current behavior
Given the following posts when populating comments with limit 2
[
{
title: 'Post 1',
commentsIds: ['1', '2', '3', '4', '5']
},
{
title: 'Post 2',
commentsIds: ['6', '7', '8', '9', '10']
},
]
Currently if we use limit mongoose sends a query on comments with limit of 4 (posts.length * limit) with the following ids: 1,2,3,4,5,6,7,8,9,10, which results in getting back 4 ids relying on natural order, meaning that even though both posts have comments, there's no guarantee they'll both get 2 comments populated, because Mongo sends back 1 to 4 when tested (we shouldn't rely on that fact since natural order can change).
This is explained in the docs: limit vs. perDocumentLimit.
Considerations for potential improvements
- When populating documents like
commentsIds above, there's no guarantee that they refer to documents that exist in the comments collection. From my experience, most people never hard-delete documents (not without cleaning up their references).
- We don't necessarily need a single limit/perDocumentLimit algorithm to handle all scenarios, we could find optimizations that are reliable when certain conditions are met, this would be better for performance, but adds to the maintenance cost of the project.
- For most of these optimizations, when users pass
sort, or match inside the populate object, those optimizations would be invalidated and we'd fallback to default/current behavior, at least in the initial phase.
- If we decide to proceed with any of those
Optimization #1 for perDocumentLimit
Instead of always sending a query per document in perDocumentLimit, there is a scenario when allCommentsIds.length <= posts.length * perDocumentLimit. In that case, sending a single find query would guarantee we'd get back all the comments. The implementation for this change can be as simple as adding those two lines in a condition somewhere
opts.limit = opts.perDocumentLimit;
delete opts.perDocumentLimit;
Optimization #2 for limit
- This is not really a performance optimization, but a behavioral optimization. The main issue with the current
limit implementation is that it's not guaranteed to return limit populated docs for each parent doc even if it has them. The reason is that we send all the subdocuments ids to MongoDB.
- If we assume that in most of the cases people don't hard-delete documents and the references are always valid, we can make the default behavior send the first
limit ids. Per document. For the data above, with limit: 2 we would call Comment.find({ _id: { $in: [1, 2, 6, 7] } });
- That way we only find two comments per post, assuming they're not hard-deleted, and if we find out that some documents are hard-deleted
foundDocs.length < ids.length, we can just send another query with the remaining ids, and accept the drawback of an additional query. The point is that this should rarely happen, assuming that most people don't hard-delete data.
- In addition to using this with
limit, this optimization can be introduced as an alternative way to do perDocumentLimit. Current perDocumentLimit sends N queries, if users don't hard-delete data, this approach should fetch all data in 1 query, and worst case scenario would be N queries.
- An alternative approach to sending a query with all ids with
limit = docs.length * opts.limit, is to use aggregation pipeline that prioritizes ids and does not rely on natural order. This would behave better than the current limit, with the guarantee that hard-deleted documents won't cause issues. Can also be used if users pass custom match and sort, because we can just pass them as extra stages in the pipeline. This would work by getting the first 2 comments ids from each post, and then rotating comments ids from each post: 1,2,6,7,3,8,4,9,5,10:
// `limit` ids from each doc, then rotate an id from each doc.
const ids = [1, 2, 6, 7, 3, 8, 4, 9, 5, 10];
const comments = await Comment.aggregate([
{ $match: { _id: { $in: ids } } },
{ $addFields: { __order: { $indexOfArray: [ids, '$_id'] } } },
{ $sort: { __order: 1 } },
{ $limit: 4 },
{ $project: { __order: 0 } }
]);
This overrides MongoDB natural order behavior, the downside is can be ~50% slower than a normal find but is more reliable (but not perfect, since if some comments have invalid references we might end up with ids being used up by a previous post).
We can use other aggregation pipelines that would give guaranteed results similar to perDocumentLimit but I would need to explore them in further detail.
For now I'm really looking to see what people think about this, whether they think it's an important improvement and worth the effort/maintenance cost or not.
I can see at least Optimization #1 being a simple enough improvement that has a potential for implementation.
In the process of tuning the performance, I created this benchmark that outputs benchmark results in both JSON and nice HTML page, would be helpful if anyone decides to try different approaches.
What do you think? @vkarpov15 @hasezoey
Prerequisites
🚀 Feature Proposal
I'm opening this issue as a place to discuss potential improvements for
limitandperDocumentLimitwithpopulate, thoughts may not be very organized since we have so many options.Mongoose current behavior
Given the following
postswhen populatingcommentswithlimit2Currently if we use
limitmongoose sends a query oncommentswithlimitof4(posts.length * limit) with the following ids:1,2,3,4,5,6,7,8,9,10, which results in getting back 4 ids relying on natural order, meaning that even though both posts have comments, there's no guarantee they'll both get 2 comments populated, because Mongo sends back1to4when tested (we shouldn't rely on that fact since natural order can change).This is explained in the docs:
limitvs.perDocumentLimit.Considerations for potential improvements
commentsIdsabove, there's no guarantee that they refer to documents that exist in thecommentscollection. From my experience, most people never hard-delete documents (not without cleaning up their references).sort, ormatchinside thepopulateobject, those optimizations would be invalidated and we'd fallback to default/current behavior, at least in the initial phase.Optimization #1 for
perDocumentLimitInstead of always sending a query per document in
perDocumentLimit, there is a scenario whenallCommentsIds.length <= posts.length * perDocumentLimit. In that case, sending a single find query would guarantee we'd get back all the comments. The implementation for this change can be as simple as adding those two lines in a condition somewhereOptimization #2 for
limitlimitimplementation is that it's not guaranteed to returnlimitpopulated docs for each parent doc even if it has them. The reason is that we send all the subdocuments ids to MongoDB.limitids. Per document. For the data above, withlimit: 2we would callComment.find({ _id: { $in: [1, 2, 6, 7] } });foundDocs.length < ids.length, we can just send another query with the remaining ids, and accept the drawback of an additional query. The point is that this should rarely happen, assuming that most people don't hard-delete data.limit, this optimization can be introduced as an alternative way to doperDocumentLimit. CurrentperDocumentLimitsends N queries, if users don't hard-delete data, this approach should fetch all data in 1 query, and worst case scenario would be N queries.limit = docs.length * opts.limit, is to use aggregation pipeline that prioritizes ids and does not rely on natural order. This would behave better than the currentlimit, with the guarantee that hard-deleted documents won't cause issues. Can also be used if users pass custommatchandsort, because we can just pass them as extra stages in the pipeline. This would work by getting the first2comments ids from each post, and then rotating comments ids from each post:1,2,6,7,3,8,4,9,5,10:This overrides MongoDB natural order behavior, the downside is can be ~50% slower than a normal
findbut is more reliable (but not perfect, since if some comments have invalid references we might end up with ids being used up by a previous post).We can use other aggregation pipelines that would give guaranteed results similar to
perDocumentLimitbut I would need to explore them in further detail.For now I'm really looking to see what people think about this, whether they think it's an important improvement and worth the effort/maintenance cost or not.
I can see at least Optimization #1 being a simple enough improvement that has a potential for implementation.
In the process of tuning the performance, I created this benchmark that outputs benchmark results in both JSON and nice HTML page, would be helpful if anyone decides to try different approaches.
What do you think? @vkarpov15 @hasezoey