统计专题
从2013年开始,Nature Methods开始刊登一系列的叫"Points of Significance."的统计专题文章。当时我并没有在意这一系列的文章(那年我更在意我的dota天梯分数),直到2017年,偶然的在次了解到这一系列文章,第一想法,Nature Methods上怎么刊登这些,有点像满汉全席里有道“小葱拌豆腐”。直到认真的开始阅读这一系列的时候,才发现,原来这道“小葱拌豆腐”在刷新我的美食观。有太多,我认为简单不会注意的东西变得异常重要。这也促使我想要将这一个专题进行翻译,希望有像我一样的人,能够从中有所收获。
Importance of being uncertain - How samples are used to estimate population statistics and what this means in terms of uncertainty.
不确定的重要性 - 用样本估计总体统计的不确定性
Error bars - The use of error bars to represent uncertainty and advice on how to interpret them.
Significance, P values and t-tests - Introduction to the concept of statistical significance and the one-sample t-test.
Power and sample size - Use of statistical power to optimize study design and sample numbers.
Visualizing samples with box plots - Introduction to box plots and their use to illustrate the spread and differences of samples.
Comparing samples—part I - How to use the two-sample t-test to compare either uncorrelated or correlated samples.
Comparing samples—part II - Adjustment and reinterpretation of P values when large numbers of tests are performed.
Nonparametric tests - Use of nonparametric tests to robustly compare skewed or ranked data.
Designing comparative experiments - The first of a series of columns that tackle experimental design shows how a paired design achieves sensitivity and specificity requirements despite biological and technical variability.
Analysis of variance and blocking - Introduction to ANOVA and the importance of blocking in good experimental design to mitigate experimental error and the impact of factors not under study.
Replication - Technical replication reveals technical variation while biological replication is required for biological inference.
Nested designs - Use the relative noise contribution of each layer in nested experimental designs to optimally allocate experimental resources using ANOVA.
Two-factor designs - It is common in biological systems for multiple experimental factors to produce interacting effects on a system. A study design that allows these interactions can increase sensitivity.
Sources of variation - To generalize experimental conclusions to a population, it is critical to sample its variation while using experimental control, randomization, blocking and replication for replicable and meaningful results.
Split plot design - When some experimental factors are harder to vary than others, a split plot design can be efficient for exploring the main (average) effects and interactions of the factors.
Bayes’ theorem - Use Bayes’ theorem to combine prior knowledge with observations of a system and make predictions about it.
Bayesian statistics - Unlike classical frequentist statistics, Bayesian statistics allows direct inference of the probability that a model is correct and it provides the ability to update this probability as new data is collected.
Sampling distributions and the bootstrap - Use the bootstrap method to simulate new samples and assess the precision and bias of sample estimates.
Bayesian networks - Model interactions between causes and effects in large networks of causal influences using Bayesian networks, which combine network analysis with Bayesian statistics.
Association, correlation and causation - Pairwise dependencies can be characterized using correlation but be aware that correlation only implies association, not causation. Conversely, causation implies association, not correlation.
Simple linear regression - Given data on the relationship between two variables, linear regression is a simple and surprisingly robust method to predict unknown values.
Multiple linear regression - When multiple variables are associated with a response, the interpretation of a prediction equation is seldom simple.
Analyzing outliers: influential or nuisance - Some outliers influence the regression fit more than others.
Regression diagnostics - Residual plots can be used to validate assumptions about the regression model.
Logistic regression - Regression can be used on categorical responses to estimate probabilities and to classify.
Classification evaluation - It is important to understand both what a classification metric expresses and what it hides.
Model selection and overfitting - "With four parameters I can fit an elephant and with five I can make him wiggle his trunk". John von Neumann
Regularization - Constraining the magnitude of parameters of a model can control its complexity.
P values and the search for significance - Little P value What are you tryign to say Of significance - Steve Ziliak
Interpreting P values - A P value measures a sample's compatability with a hypothesis, not the truth of the hypothesis.
Tabular data - Tabulating the number of objects in categories of interest dates back to the earliest records of commerce and population censuses.
Clusterin - Clustering finds patterns in data, whether they are there or not.
Principal component analysis - PCA helps you interpret your data, but it will not always find the important patterns.
Classification and regression trees - Decision trees are a simple but powerful prediction method.
Ensemble methods: bagging and random forests - Many heads are better than one.
Machine learning: a primer - Machine learning extracts patterns from data without explicit instructions.
Machine learning: supervised methods - Supervised learning algorithms extract general principles from observed examples guided by a specific prediction objective.