统计专题

从2013年开始,Nature Methods开始刊登一系列的叫"Points of Significance."的统计专题文章。当时我并没有在意这一系列的文章(那年我更在意我的dota天梯分数),直到2017年,偶然的在次了解到这一系列文章,第一想法,Nature Methods上怎么刊登这些,有点像满汉全席里有道“小葱拌豆腐”。直到认真的开始阅读这一系列的时候,才发现,原来这道“小葱拌豆腐”在刷新我的美食观。有太多,我认为简单不会注意的东西变得异常重要。这也促使我想要将这一个专题进行翻译,希望有像我一样的人,能够从中有所收获。

  • Importance of being uncertain - How samples are used to estimate population statistics and what this means in terms of uncertainty.

    不确定的重要性 - 用样本估计总体统计的不确定性

  • Error bars - The use of error bars to represent uncertainty and advice on how to interpret them.

  • Significance, P values and t-tests - Introduction to the concept of statistical significance and the one-sample t-test.

  • Power and sample size - Use of statistical power to optimize study design and sample numbers.

  • Visualizing samples with box plots - Introduction to box plots and their use to illustrate the spread and differences of samples.

  • Comparing samples—part I - How to use the two-sample t-test to compare either uncorrelated or correlated samples.

  • Comparing samples—part II - Adjustment and reinterpretation of P values when large numbers of tests are performed.

  • Nonparametric tests - Use of nonparametric tests to robustly compare skewed or ranked data.

  • Designing comparative experiments - The first of a series of columns that tackle experimental design shows how a paired design achieves sensitivity and specificity requirements despite biological and technical variability.

  • Analysis of variance and blocking - Introduction to ANOVA and the importance of blocking in good experimental design to mitigate experimental error and the impact of factors not under study.

  • Replication - Technical replication reveals technical variation while biological replication is required for biological inference.

  • Nested designs - Use the relative noise contribution of each layer in nested experimental designs to optimally allocate experimental resources using ANOVA.

  • Two-factor designs - It is common in biological systems for multiple experimental factors to produce interacting effects on a system. A study design that allows these interactions can increase sensitivity.

  • Sources of variation - To generalize experimental conclusions to a population, it is critical to sample its variation while using experimental control, randomization, blocking and replication for replicable and meaningful results.

  • Split plot design - When some experimental factors are harder to vary than others, a split plot design can be efficient for exploring the main (average) effects and interactions of the factors.

  • Bayes’ theorem - Use Bayes’ theorem to combine prior knowledge with observations of a system and make predictions about it.

  • Bayesian statistics - Unlike classical frequentist statistics, Bayesian statistics allows direct inference of the probability that a model is correct and it provides the ability to update this probability as new data is collected.

  • Sampling distributions and the bootstrap - Use the bootstrap method to simulate new samples and assess the precision and bias of sample estimates.

  • Bayesian networks - Model interactions between causes and effects in large networks of causal influences using Bayesian networks, which combine network analysis with Bayesian statistics.

  • Association, correlation and causation - Pairwise dependencies can be characterized using correlation but be aware that correlation only implies association, not causation. Conversely, causation implies association, not correlation.

  • Simple linear regression - Given data on the relationship between two variables, linear regression is a simple and surprisingly robust method to predict unknown values.

  • Multiple linear regression - When multiple variables are associated with a response, the interpretation of a prediction equation is seldom simple.

  • Analyzing outliers: influential or nuisance - Some outliers influence the regression fit more than others.

  • Regression diagnostics - Residual plots can be used to validate assumptions about the regression model.

  • Logistic regression - Regression can be used on categorical responses to estimate probabilities and to classify.

  • Classification evaluation - It is important to understand both what a classification metric expresses and what it hides.

  • Model selection and overfitting - "With four parameters I can fit an elephant and with five I can make him wiggle his trunk". John von Neumann

  • Regularization - Constraining the magnitude of parameters of a model can control its complexity.

  • P values and the search for significance - Little P value What are you tryign to say Of significance - Steve Ziliak

  • Interpreting P values - A P value measures a sample's compatability with a hypothesis, not the truth of the hypothesis.

  • Tabular data - Tabulating the number of objects in categories of interest dates back to the earliest records of commerce and population censuses.

  • Clusterin - Clustering finds patterns in data, whether they are there or not.

  • Principal component analysis - PCA helps you interpret your data, but it will not always find the important patterns.

  • Classification and regression trees - Decision trees are a simple but powerful prediction method.

  • Ensemble methods: bagging and random forests - Many heads are better than one.

  • Machine learning: a primer - Machine learning extracts patterns from data without explicit instructions.

  • Machine learning: supervised methods - Supervised learning algorithms extract general principles from observed examples guided by a specific prediction objective.

results matching ""

    No results matching ""