Aggregation pipeline stages
Aggregation pipeline is an array of aggregation methods that process data stage-by-stage until a result is returned at the end of the pipeline.
Aggregation pipeline stages for a collection is executed by the method:
db.collection.aggregate()
Stages are passed into the aggregate method in this syntax:
db.collection.aggregate( [ { <stage> }, ... ] )
Most stages can be repeated in a pipeline, except these stages: $out
, $merge
, and $geoNear
.
There are over 30 built-in stages available for collection aggregation. We will be focusing on those stages that may help us answer the questions set in the previous chapter.
A list of all the stages:
https://docs.mongodb.com/manual/reference/operator/aggregation-pipeline/
Stages to be learned, and their functions according to MongoDB documentation:
$match
:
Filters the document stream to allow only matching documents to pass unmodified into the next pipeline stage. $match uses standard MongoDB queries. For each input document, outputs either one document (a match) or zero documents (no match).$group
:
Groups input documents by a specified identifier expression and applies the accumulator expression(s), if specified, to each group. Consumes all input documents and outputs one document per each distinct group. The output documents only contain the identifier field and, if specified, accumulated fields.$count
:
Returns a count of the number of documents at this stage of the aggregation pipeline.$limit
:
Passes the first n documents unmodified to the pipeline where n is the specified limit. For each input document, outputs either one document (for the first n documents) or zero documents (after the first n documents).$skip
:
Skips the first n documents where n is the specified skip number and passes the remaining documents unmodified to the pipeline. For each input document, outputs either zero documents (for the first n documents) or one document (if after the first n documents).
Aggregation pipeline operators
Aggregation pipeline operators are used inside aggregation pipeline stages. They can perform mathematical, logical, and formatting operations in a stage.
Accumulator operators
Accumulator operators are executed inside some pipeline operators to perform actions on accumulated documents. Some common operators:
$avg
$max
$min
$sum
$multiply
These accumulator operators will be demonstrated in the following pipeline stages.
$match and $count stage
Count all the documents at Austin store:
> db.sales.aggregate([
{
$match: { storeLocation: "Austin" }
},
{
$count: "counter"
}
])
Output:
{ "counter" : 676 }
In the above example, the $match
stage gather all the documents that match the set condition and pass the results to $count
stage. $count
stage counts the number of document from the previous stage, and output the result to the variable counter.
For example, to get the sum of all the notepad that has been sold at Austin store:
$group stage
$group
stage groups documents by a field, and then optionally perform accumulator operators.
For example, to find the average satisfaction score at Austin:
> db.sales.aggregate([
{
$match: {
storeLocation: "Austin"
}
},
{
$group: {
_id: "",
average_satisfaction: {
$avg: "$customer.satisfaction"
}
}
}
])
Output:
{ "_id" : "", "average_satisfaction" : 3.7677514792899407 }
To find all the average satisfaction score by store:
> db.sales.aggregate([
{
$match: {}
},
{
$group: {
_id: "$storeLocation",
average_satisfaction: {
$avg: "$customer.satisfaction"
}
}
}
])
Output:
{ "_id" : "Austin", "average_satisfaction" : 3.7677514792899407 }
{ "_id" : "New York", "average_satisfaction" : 3.802395209580838 }
{ "_id" : "San Diego", "average_satisfaction" : 3.7890173410404624 }
{ "_id" : "London", "average_satisfaction" : 3.802267002518892 }
{ "_id" : "Denver", "average_satisfaction" : 3.7998708844415754 }
{ "_id" : "Seattle", "average_satisfaction" : 3.7901234567901234 }
Ordering of outputs generated by an aggregation pipeline
From the above example, we can see that the results are not naturally ordered. It is most likely due to the parallel nature of some pipelines stages. To sort the output, we need to use the $sort
pipeline.
$sort and $limit stage
These two stages accept parameters similar to their collection method version.
To find the store with the highest average satisfaction score:
> db.sales.aggregate([
{
$match: {}
},
{
$group: {
_id: "$storeLocation",
average_satisfaction: {
$avg: "$customer.satisfaction"
}
}
},
{
$sort: {
average_satisfaction: -1
}
},
{
$limit: 1
}
])
Output:
{ "_id" : "New York", "average_satisfaction" : 3.802395209580838 }
$project stage
Similar to the project()
method for collection, $project
stage allows me to select/deselect fields in an output.
Syntax:
$project: {
<field>: <1 or 0>,
<field>: <1 or 0>,
...
}
$unwind stage
$unwind
stage deconstructs an array in a document and outputs a document for each element in the array.
For example, if we have a document as following:
{
name: "Bob",
nicknames: ["Bobby", "Handsome", "Rich"]
}
Assuming that is the only document in the collection people, then the $unwind
stage will output:
> db.people.aggregate([ { $unwind: "$nicknames" } ])
{
name: "Bob",
nicknames: "Bobby"
},
{
name: "Bob",
nicknames: "Handsome"
},
{
name: "Bob",
nicknames: "Rich"
}
More accumulator operators
To find the total value of an order, we need to calculate the sum of all items. For each item, we need to calculate the product of the price and the quantity.
Find the total revenue of all orders from all store locations:
> db.sales.aggregate([
{
$match: {}
},
{
$project: {
items: 1
}
},
{
$unwind: "$items"
},
{
$group: {
_id: "",
total_revenue: {
$sum : {
$multiply: ["$items.price", "$items.quantity"]
}
}
}
},
])
Output:
{ "_id" : "", "total_revenue" : NumberDecimal("9894572.27") }
// Total revenue from all stores = $9894572.27
Breaking down the above aggregation pipeline:
$match
stage: Get all documents in sales collection.$project
stage: Only include the items field for each document.$unwind
stage: Deconstruct the array, items, and put them into separate documents.$group
stage:_id
is null, so all documents are treated as the same group.total_revenue
is a custom output field name$sum
gets the sum of all the values from the next operator$multiply
calculates the product of two fields: items.price and items.quantity
Answering the questions
How much revenue did we make in a year?
> db.sales.aggregate([
{
$match: {
saleDate: {
$gte: ISODate("2015-01-01T00:00:00.000Z"),
$lte: ISODate("2015-12-31T23:59:59.999Z"),
}
}
},
{
$project: {
items: 1
}
},
{
$unwind: "$items"
},
{
$group: {
_id: "",
total_revenue: {
$sum : {
$multiply: ["$items.price", "$items.quantity"]
}
}
}
},
])
Output:
{ "_id" : "", "total_revenue" : NumberDecimal("1979871.31") }
// In year 2015, we made $1979871.31 in total.
Group the results by month
> db.sales.aggregate([
{
$match: {
saleDate: {
$gte: ISODate("2015-01-01T00:00:00.000Z"),
$lte: ISODate("2015-12-31T23:59:59.999Z"),
}
}
},
{
$project: {
saleDate: 1,
items: 1
}
},
{
$unwind: "$items"
},
{
$group: {
_id: {
$month: "$saleDate"
},
monthly_revenue: {
$sum : {
$multiply: ["$items.price", "$items.quantity"]
}
}
}
},
{
$sort: {
_id: 1
}
}
])
Output:
{ "_id" : 1, "monthly_revenue" : NumberDecimal("222577.01") }
{ "_id" : 2, "monthly_revenue" : NumberDecimal("189776.15") }
{ "_id" : 3, "monthly_revenue" : NumberDecimal("178086.94") }
{ "_id" : 4, "monthly_revenue" : NumberDecimal("132298.50") }
{ "_id" : 5, "monthly_revenue" : NumberDecimal("167482.69") }
{ "_id" : 6, "monthly_revenue" : NumberDecimal("149484.34") }
{ "_id" : 7, "monthly_revenue" : NumberDecimal("149510.47") }
{ "_id" : 8, "monthly_revenue" : NumberDecimal("154868.63") }
{ "_id" : 9, "monthly_revenue" : NumberDecimal("135615.80") }
{ "_id" : 10, "monthly_revenue" : NumberDecimal("175215.34") }
{ "_id" : 11, "monthly_revenue" : NumberDecimal("158390.42") }
{ "_id" : 12, "monthly_revenue" : NumberDecimal("166565.02") }
- The
$month
operator obtain the number of month from the ISODate object.
Which item is the most/least popular?
> db.sales.aggregate([
{
$project: {
items: 1
}
},
{
$unwind: "$items"
},
{
$group: {
_id: "$items.name",
total_sold: {
$sum : "$items.quantity"
}
}
},
{
$sort: {
total_sold: -1
}
}
])
Output:
{ "_id" : "binder", "total_sold" : 25493 }
{ "_id" : "envelopes", "total_sold" : 25078 }
{ "_id" : "notepad", "total_sold" : 20727 }
{ "_id" : "pens", "total_sold" : 13766 }
{ "_id" : "printer paper", "total_sold" : 12092 }
{ "_id" : "backpack", "total_sold" : 6918 }
{ "_id" : "laptop", "total_sold" : 6793 }
Binder is the most popular item with 25493 units sold.
Envelopes and notepad are the two next popular items.
The laptop is the worst in terms of unit sales, with only 6793 units sold.
Which item made the most/least revenue?
> db.sales.aggregate([
{
$project: {
items: 1
}
},
{
$unwind: "$items"
},
{
$group: {
_id: "$items.name",
total_revenue: {
$sum : {
$multiply: [ "$items.price", "$items.quantity"]
}
}
}
},
{
$sort: {
total_revenue: -1
}
}
])
Output:
{ "_id" : "laptop", "total_revenue" : NumberDecimal("6775977.07") }
{ "_id" : "backpack", "total_revenue" : NumberDecimal("817374.10") }
{ "_id" : "pens", "total_revenue" : NumberDecimal("581843.27") }
{ "_id" : "binder", "total_revenue" : NumberDecimal("511644.57") }
{ "_id" : "notepad", "total_revenue" : NumberDecimal("463615.48") }
{ "_id" : "envelopes", "total_revenue" : NumberDecimal("376658.49") }
{ "_id" : "printer paper", "total_revenue" : NumberDecimal("367459.29") }
We can see that laptops made the most money, netting $6,775,977.07.
The second most grossing item is the backpack, totalling $817,374.10.
The item making the least amount of money is printer paper, counting $367,459.29.
Which item category made the most/least revenue?
> db.sales.aggregate([
{
$project: {
items: 1
}
},
{
$unwind: "$items"
},
{
$unwind: "$items.tags"
},
{
$group: {
_id: "$items.tags",
total_revenue: {
$sum : {
$multiply: [ "$items.price", "$items.quantity"]
}
}
}
},
{
$sort: {
total_revenue: -1
}
}
])
Output:
{ "_id" : "school", "total_revenue" : NumberDecimal("9150454.49") }
{ "_id" : "office", "total_revenue" : NumberDecimal("8565553.60") }
{ "_id" : "electronics", "total_revenue" : NumberDecimal("6775977.07") }
{ "_id" : "stationary", "total_revenue" : NumberDecimal("1325961.05") }
{ "_id" : "writing", "total_revenue" : NumberDecimal("1045458.75") }
{ "_id" : "general", "total_revenue" : NumberDecimal("888303.06") }
{ "_id" : "kids", "total_revenue" : NumberDecimal("817374.10") }
{ "_id" : "travel", "total_revenue" : NumberDecimal("817374.10") }
{ "_id" : "organization", "total_revenue" : NumberDecimal("511644.57") }
The category with the most revenue is “school”, at $9,150,454.49.
The least grossing category is “organization”, at $511,644.57.
Is there any notable pattern in the customer group?
One way of trying to find patterns in customers is to group them by ages.
For example:
- 0 - 17
- 18 - 29
- 30 - 49
- 50 - 69
- 70 and above
Finding the number of customers at each age
> db.sales.aggregate([
{
$project: {
customer: 1
}
},
{
$group: {
_id: "$customer.email",
age: {
$last: "$customer.age"
},
gender: {
$last: "$customer.gender"
}
}
},
{
$group: {
_id: "$age",
count: {
$sum: 1
}
}
},
{
$sort: {
_id: 1
}
}
])
Output:
{ "_id" : 16, "count" : 42 }
{ "_id" : 17, "count" : 38 }
{ "_id" : 18, "count" : 40 }
{ "_id" : 19, "count" : 48 }
{ "_id" : 20, "count" : 53 }
{ "_id" : 21, "count" : 49 }
{ "_id" : 22, "count" : 39 }
{ "_id" : 23, "count" : 48 }
{ "_id" : 24, "count" : 44 }
{ "_id" : 25, "count" : 93 }
{ "_id" : 26, "count" : 76 }
{ "_id" : 27, "count" : 81 }
{ "_id" : 28, "count" : 65 }
{ "_id" : 29, "count" : 92 }
{ "_id" : 30, "count" : 73 }
{ "_id" : 31, "count" : 97 }
{ "_id" : 32, "count" : 80 }
{ "_id" : 33, "count" : 110 }
{ "_id" : 34, "count" : 138 }
{ "_id" : 35, "count" : 131 }
Type "it" for more
There are customers with age more than 35, but there may be too many to show on a screen.
Note that the customers are first grouped by email to remove duplicates.
We can run the aggregation above again, but with descending sort instead:
Output:
{ "_id" : 75, "count" : 37 }
{ "_id" : 74, "count" : 34 }
{ "_id" : 73, "count" : 41 }
{ "_id" : 72, "count" : 46 }
{ "_id" : 71, "count" : 44 }
{ "_id" : 70, "count" : 61 }
{ "_id" : 69, "count" : 39 }
{ "_id" : 68, "count" : 41 }
{ "_id" : 67, "count" : 46 }
{ "_id" : 66, "count" : 48 }
{ "_id" : 65, "count" : 83 }
{ "_id" : 64, "count" : 71 }
{ "_id" : 63, "count" : 77 }
{ "_id" : 62, "count" : 84 }
{ "_id" : 61, "count" : 87 }
{ "_id" : 60, "count" : 93 }
{ "_id" : 59, "count" : 104 }
{ "_id" : 58, "count" : 82 }
{ "_id" : 57, "count" : 96 }
{ "_id" : 56, "count" : 84 }
So, we can see that the youngest customer is 16 years old, while the oldest is 75 years old.
Finding the number of customers at each age group
From 16 to 29:
> db.sales.aggregate([
{
$match: {
"customer.age": {
$gte:50,
$lte: 75
}
}
},
{
$project: {
customer: 1
}
},
{
$group: {
_id: "$customer.email",
age: {
$last: "$customer.age"
},
gender: {
$last: "$customer.gender"
}
}
},
{
$group: {
_id: "",
count: {
$sum: 1
}
}
}
])
Output:
{ "_id" : "", "count" : 808 }
// There are 808 unique customers betwenn 16-29 years old.
Repeat the aggregation with the remaining age groups:
{ "_id" : "", "count" : 2346 }
// 2346 unique customers between 30-49 years old.
{ "_id" : "", "count" : 1846 }
// 1846 unique customers between 50-75 years old.
To obtain the count of customers in age groups with one aggregation pipeline:
> db.sales.aggregate([
{
$project: {
customer: 1
}
},
{
$group: {
_id: {
$multiply: [ 20, {
$floor: {
$divide: [ "$customer.age", 20]
}
}]
},
count: {
$sum: 1
}
}
},
{
$sort: {
_id: 1
}
}
])
Output:
{ "_id" : 0, "count" : 168 } // 0 - 19 yo
{ "_id" : 20, "count" : 1788 } // 20 - 39 yo
{ "_id" : 40, "count" : 2112 } // 40 - 59 yo
{ "_id" : 60, "count" : 932 } // 60 - 79 yo
We can see that most customers are in the age group of 40-59 years old. There are only 168 customers less than 20 years old.
Group documents by range
$bucket stage
We have used $group
stage to group documents by a property. But, if we want to group documents with a range, like years between 2014-2015, we can use the $bucket
stage.
To find the number of sales documents between 2013-2014 and 2015-2016:
db.sales.aggregate([
{
$match: {}
},
{
$bucket: {
groupBy: { $year: "$saleDate" },
boundaries: [2013, 2015, 2017],
default: "Other",
output: {
"count": { $sum: 1},
}
}
}
])
Output:
[
{ _id: 2013, count: 1932 },
{ _id: 2015, count: 2016 },
{ _id: 'Other', count: 1052 }
]
So, between 2013-2014, there were 1932 transactions. Between 2015-2016, there were 2016 transactions.
To find the total revenue between 2013-2014 and 2015-2016:
db.sales.aggregate([
{
$match: {}
},
{
$unwind: "$items"
},
{
$bucket: {
groupBy: { $year: "$saleDate" },
boundaries: [2013, 2015, 2017],
default: "Other",
output: {
"totalRevenue": {
$sum: {
$multiply: ["$items.price", "$items.quantity"]
}
}
}
}
}
])
Output:
[
{ _id: 2013, totalRevenue: Decimal128("3794028.20") },
{ _id: 2015, totalRevenue: Decimal128("3914691.17") },
{ _id: 'Other', totalRevenue: Decimal128("2185852.90") }
]
- Between 2013-2014, we made $3,794,028.20
- Between 2015-2016, we made $3,914,691.17
- “Other” represents sales made before 2013, or in 2017 or later.
$bucketAuto stage
We can let MongoDB decide how to divide documents into buckets. Instead of specifying boundries, we simply decide how many buckets should be maximumly divided into.
Example:
db.sales.aggregate([
{
$match: {}
},
{
$bucketAuto: {
groupBy: { $year: "$saleDate" },
buckets: 3,
output: {
"count": { $sum: 1},
}
}
}
])
Output:
[
{ _id: { min: 2013, max: 2015 }, count: 1932 },
{ _id: { min: 2015, max: 2017 }, count: 2016 },
{ _id: { min: 2017, max: 2017 }, count: 1052 }
]
Note that min:
is inclusive, while max:
is exclusive.