fix typos in docs

t-vi · t-vi · commit 2e813f6669fc · 2021-03-07T21:03:41.000+01:00
diff --git a/notebooks/comparing_drift_detectors.ipynb b/notebooks/comparing_drift_detectors.ipynb
@@ -246,7 +246,7 @@
     "## Kernel MMD drift detector\n",
     "\n",
     "\n",
-    "Or first detector is the Kernel MMD drift detector. As you may have guessed from the name, it uses a kernel to define a metric on the space of distributions on the feature-space (see our [note on the intuition behind MMD](./note_on_mmd.ipynb). TorchDrift implements a few kernels in the `detectors.mmd` module, the `GaussianKernel` (also known as squared exponential) is the default, `ExpKernel` (aka Laplacian Kernel) and `RationalQuadraticKernel` are also available.\n",
+    "Or first detector is the Kernel MMD drift detector. As you may have guessed from the name, it uses a kernel to define a metric on the space of distributions on the feature-space (see our [note on the intuition behind MMD](./note_on_mmd.ipynb)). TorchDrift implements a few kernels in the `detectors.mmd` module, the `GaussianKernel` (also known as squared exponential) is the default, `ExpKernel` (aka Laplacian Kernel) and `RationalQuadraticKernel` are also available.\n",
     "\n",
     "In our experiments Kernel MMD worked very well, so we suggest it as a default."
    ]
@@ -390,7 +390,7 @@
    "source": [
     "## Untrained Autoencoder\n",
     "\n",
-    "Finally we use the Untrained Autoencoder. This is a bit of a funny name because it really half an autoencode, so we might as well call it a untrained or randomly initialized feature extractor. This performed reasonably well in _Failing Loudly_, so it appears relatively frequently.\n",
+    "Finally we use the Untrained Autoencoder. This is a bit of a funny name because it really half an autoencoder, so we might as well call it a untrained or randomly initialized feature extractor. This performed reasonably well in _Failing Loudly_, so it appears relatively frequently.\n",
     "\n",
     "In our experiments, this does not work as well as in the ones in _Failing Loudly_. Part of it may be that we have larger images so the feature extractor has \"more work to do\" and a purely random one does not perform as well. Another part may be that our sample size is lower. We believe that in both of these aspects, our setup is closer to (our) real-world use-cases.\n",
     "\n",
diff --git a/notebooks/drift_detection_on_images.ipynb b/notebooks/drift_detection_on_images.ipynb
@@ -491,7 +491,7 @@
    "id": "defined-manner",
    "metadata": {},
    "source": [
-    "For drift detection, we need a feature extractor (of course, we had one above, too, but let's play along and pretend we got the `model` from our colleague."
+    "For drift detection, we need a feature extractor (of course, we had one above, too, but let's play along and pretend we got the `model` from our colleague)."
    ]
   },
   {
diff --git a/notebooks/drift_detection_overview.ipynb b/notebooks/drift_detection_overview.ipynb
@@ -20,11 +20,9 @@
     "- Label drift: $P(X | Y) = P_{ref}(X | Y)$ but $P(Y) \\neq P_{ref}(Y)$,\n",
     "- Concept drift: $P(Y | X) \\neq P_{ref}P(Y | X)$ (this differs from S. Zhao et al.).\n",
     "\n",
-    "These are not exclusive, e.g. when the class balance changes without the class-conditional distributions changing, you would observe both covarite and label shift.\n",
+    "These are not exclusive. For example, suppose we have a classifier between cats and dogs and now the class balance changes so that more inputs are actually dogs even though the cats and dogs by themselves look the same (in fancy-speak they have the same class-conditional distribution). By definition that is label drift. But it also is covariate drift because the more dog pictures we are seeing means that the input distribution has changed.\n",
     "\n",
-    "There are several things we would want to do that are related to drift:\n",
-    "- In _domain adaptation_ we try to make models which can cope with the new distribution $P$.\n",
-    "- In _drift detection_ (here) we are interested in detecting whether drift has happened. It will be hard to deal with concept drift because we typically do not have access to $Y$ from $P(Y)$, but we will see what can be done about the others.\n",
+    "There are several things we would want to do that are related to drift: In _domain adaptation_ we try to make models which can cope with the new distribution $P$. In _drift detection_ (here) we are interested in detecting whether drift has happened. It will be hard to deal with concept drift because we typically do not have access to $Y$ from $P(Y)$, but we will see what can be done about the others.\n",
     "\n",
     "In contrast to the drift detection we are concerned with, [outlier detection](https://en.wikipedia.org/wiki/Anomaly_detection) mainly investigates a single datapoint. The assessment then is whether it the observed datapoint is exceedingly improbable. The presence of outliers (in unexpected quantities) indicates a distribution shift, but the distribution shift may also consider an unexpected \"narrowing\" of the observation - in the extreme you may imagine suddenly only seeing the same perfectly normal example over and over again, this would not be an outlier, but certainly a drift. This assessment is possible because we work with a multiple samples from the test distribution at once. In fact, many of the statistical testing methods we apply below are essentially symmetric in the reference and the tested distribution.\n",
     "\n",
@@ -33,6 +31,7 @@
     "When framed as above, drift detection is the question if $P(X,Y) \\neq P_{ref}(X,Y)$. Given our setting, that we observe samples from $P(X,Y)$ or $P(X)$ after previously observing samples from $P_{ref}(X,Y)$, it is natural to look at statistical hypothsis for with the null hypothesis $P(X, Y) = P_{ref}(X,Y)$.\n",
     "\n",
     "But there are several difficulties:\n",
+    "\n",
     "- We typically do not have $Y$ (e.g. the ground truth label) from $P$.\n",
     "- The inputs $X$ typically are very high-dimensional - e.g. a 224x224 RGB picture is a 150'000 dimensional input.\n",
     "  \n",
@@ -41,6 +40,7 @@
     "So $X$ is all we have, and that is too high-dimensional to work with directly. But this means that we have to reduce the dimensionality. \n",
     "\n",
     "In summary,\n",
+    "\n",
     "- We typically need to work with $X$. We assume that we have a (sizeable) sample from $P_{ref}(X)$ and also get a sample from $P(X)$ to decide with.\n",
     "- We need a way to reduce the dimensionality.\n",
     "- We need a statistical test for $P = P_{ref}$.\n",
@@ -49,7 +49,7 @@
     "\n",
     "## Dimension reduction\n",
     "\n",
-    "There are two main ways to do do the dimension reduction.\n",
+    "There are two main ways to do the dimension reduction.\n",
     "\n",
     "An obvious choice might be to turn to estabilished methods for dimension reduction. There are plenty of them, the simplest is perhaps the principal component analysis (PCA). And indeed this is one of the routes we will take.\n",
     "\n",
@@ -235,7 +235,7 @@
    "source": [
     "Here, the model components are in boxes and the input, output and intermediate values are in ellipses.\n",
     "\n",
-    "If $X$ at the very top is not a good candidate, so we might use any other - the features, the scores, the class probabiities or (at least we might conceptually) the predicted class.\n",
+    "If $X$ at the very top is not a good candidate, so we might use any other - the features, the scores, the class probabities or (at least we might conceptually) the predicted class.\n",
     "\n",
     "But we could also avoid using the model $m$ for our task and replace it with another, e.g. a feature extractor trained in a self-supervised fashion or an a different task like ImageNet. One such auxilliary type of models that has been used are autoencoders.\n",
     "\n",
@@ -261,6 +261,7 @@
     "We already see two important parameters here: The number $N$ of sample points from the reference and the number $M$ of sample points to be tested. As a rule of thumb, more samples help in more reliable testing, but we need to balance this with the compute requirement and, perhaps even more importantly, the time and cost of data acquisition. For the test data in particular, our desire to for timely monitoring may limit how many samples we want to wait for.\n",
     "\n",
     "Now we have two types of tests:\n",
+    "\n",
     "- Some tests, such as the maximum mean discrepancy test ([A. Getton et al.: A Kernel Two-Sample Test, JMLR 2012](https://jmlr.csail.mit.edu/papers/v13/gretton12a.html)) can be directly applied on the data for any $d$, even if\n",
     "  large $d$ is undesirable.\n",
     "- The most classical tests like the two-sample Kolmogorov-Smirnov test, are for one-dimensional data only.\n",
@@ -271,13 +272,13 @@
     "  \n",
     "  This loses power: First, the distribution might dramatically change while the marginals stay the same, and we would have no way to detect this change on the marginals. Secondly, we now make $d$ tests and if we wish to compute p-values, we need to [adjust for this](https://en.wikipedia.org/wiki/Multiple_comparisons_problem). The (computationally, but not necessarily philosophically) simplest adjustment is a [Bonferroni correction](https://en.wikipedia.org/wiki/Bonferroni_correction), where we divide the significance level (the p-value) by $d$ and check if any test meets this harder significance criterion.\n",
     "\n",
-    "As the p-value gives us the expected rate of \"false alarms\" if the null hypotheses _no drift_ remains valid, just leaning on it does leave something to be desired in terms of calibrating detection rates. In many practical applications, we may not have samples of drifted data, so we have to make to with the p-value only. Typically we would choose rather low p-values. If we do have access to (potentially fabricated) samples from the drifted set, we can extend our analysis to consider the [receiver operating characteristic (ROC) curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic), so we can trade off the false positive rate against the statistical sensitivity of the detector. Examples of this analysis are provided in [drift detection on images](./drift_detection_on_images.ipynb).\n",
+    "As the p-value gives us the expected rate of \"false alarms\" if the null hypotheses _no drift_ remains valid, just leaning on it does leave something to be desired in terms of calibrating detection rates. In many practical applications, we may not have samples of drifted data, so we have to make do with the p-value only. Typically we would choose rather low p-values. If we do have access to (potentially fabricated) samples from the drifted set, we can extend our analysis to consider the [receiver operating characteristic (ROC) curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic), so we can trade off the false positive rate against the statistical sensitivity of the detector. Examples of this analysis are provided in [drift detection on images](./drift_detection_on_images.ipynb).\n",
     "\n",
     "We will discuss individual tests in [Comparing Drift Detectors](./comparing_drift_detectors.ipynb).\n",
     "\n",
     "## Drift detection as classification\n",
     "\n",
-    "Another approach to the dirft detection problem can be to try to classify samples into coming from $P_{ref}$ or $P$, respectively. If we cannot train a classifier that works better than random, we may conclude that $P_{ref}$ and $P$ are indistinguishable. This may sound very familiar, because it also is at the core of the Generative Adversarial Network\n",
+    "Another approach to the drift detection problem can be to try to classify samples into coming from $P_{ref}$ or $P$, respectively. If we cannot train a classifier that works better than random, we may conclude that $P_{ref}$ and $P$ are indistinguishable. This may sound very familiar, because it also is at the core of the Generative Adversarial Networks. It has also been used with great success for domain adaptation, where many methods use a train domain classifier as an auxiliary task and then minimize the ability to distinguish between source and target from the features used in the main task.\n",
     "\n",
     "To make this operational, we can get out our toolbox of classifiers, e.g. Neural Networks and Nearest-Neighbor ones, see [D. Lopez-Paz, M. Oquab: Revisiting classifier two-sample tests, ICLR 2017](https://arxiv.org/abs/1610.06545). Note that this approach can be data-intensive: To execute, we need to split the samples $x^{ref}_i$ and $x_i$ into train and test samples. When using neural networks, we also need to train the classifier, adding computational requirements. When we have enough data and time, we may hope that such a classification-based approach may be highly effective.\n"
    ]
diff --git a/notebooks/note_on_mmd.ipynb b/notebooks/note_on_mmd.ipynb
@@ -220,7 +220,7 @@
     "\n",
     "## Bootstrapping\n",
     "\n",
-    "The other route is to try to give a threshold for a given confidence level or equivalently to convert $\\widehat{MMD}$ values (given the sample sizes) into $p$-values. We could try to derived expressions for this (and earlier papers of the same group of authors do, see the link above), but the conceptually easiest way is to sample from the distribution $\\widehat{MMD}$ under the null-hypothesis using bootstrapping. In this technique, we approximate sampling from the null-hypothesis by shuffeling between the $X_\\cdot$ and $Y_\\cdot$, so that both the $x$ and $y$ argument come from the same distribution. If one sample is sufficiently large (e.g. the training sample), we might compute the thresholds just on that, too.\n",
+    "The other route is to try to give a threshold for a given confidence level or equivalently to convert $\\widehat{MMD}$ values (given the sample sizes) into $p$-values. We could try to derive expressions for this (and earlier papers of the same group of authors do, see the link above), but the conceptually easiest way is to sample from the distribution $\\widehat{MMD}$ under the null-hypothesis using bootstrapping. In this technique, we approximate sampling from the null-hypothesis by shuffling between the $X_\\cdot$ and $Y_\\cdot$, so that both the $x$ and $y$ argument come from the same distribution. If one sample is sufficiently large (e.g. the training sample), we might compute the thresholds just on that, too.\n",
     "\n",
     "*Note*: To do this efficiently, it is recommended to use custom CPU or GPU kernels.\n"
    ]

Original file line number	Diff line number	Diff line change
`@@ -491,7 +491,7 @@`
`491`	`491`	`"id": "defined-manner",`
`492`	`492`	`"metadata": {},`
`493`	`493`	`"source": [`
`494`		- "For drift detection, we need a feature extractor (of course, we had one above, too, but let's play along and pretend we got the `model` from our colleague."
	`494`	+ "For drift detection, we need a feature extractor (of course, we had one above, too, but let's play along and pretend we got the `model` from our colleague)."
`495`	`495`	`]`
`496`	`496`	`},`
`497`	`497`	`{`
Original file line number	Diff line number	Diff line change
`@@ -220,7 +220,7 @@`
`220`	`220`	`"\n",`
`221`	`221`	`"## Bootstrapping\n",`
`222`	`222`	`"\n",`
`223`		- "The other route is to try to give a threshold for a given confidence level or equivalently to convert $\\widehat{MMD}$ values (given the sample sizes) into $p$-values. We could try to derived expressions for this (and earlier papers of the same group of authors do, see the link above), but the conceptually easiest way is to sample from the distribution $\\widehat{MMD}$ under the null-hypothesis using bootstrapping. In this technique, we approximate sampling from the null-hypothesis by shuffeling between the $X_\\cdot$ and $Y_\\cdot$, so that both the $x$ and $y$ argument come from the same distribution. If one sample is sufficiently large (e.g. the training sample), we might compute the thresholds just on that, too.\n",
	`223`	+ "The other route is to try to give a threshold for a given confidence level or equivalently to convert $\\widehat{MMD}$ values (given the sample sizes) into $p$-values. We could try to derive expressions for this (and earlier papers of the same group of authors do, see the link above), but the conceptually easiest way is to sample from the distribution $\\widehat{MMD}$ under the null-hypothesis using bootstrapping. In this technique, we approximate sampling from the null-hypothesis by shuffling between the $X_\\cdot$ and $Y_\\cdot$, so that both the $x$ and $y$ argument come from the same distribution. If one sample is sufficiently large (e.g. the training sample), we might compute the thresholds just on that, too.\n",
`224`	`224`	`"\n",`
`225`	`225`	`"Note: To do this efficiently, it is recommended to use custom CPU or GPU kernels.\n"`
`226`	`226`	`]`