Published in 2013, but still very interesting, and different from most data science books. Authors: Ian Langmore and Daniel Krasner.. This book focuses more on the statistics end of things, while also getting readers going on (basic) programming & command line skills. It doesn’t, however, really go into much of the stuff you would expect to see from the machine learning end of things.
Source for picture: check page 68 in the book.
You can download the book here. For other related books, check out our recommended reading list.
Content
I Programming Prerequisites
1 Unix
- History and Culture . . . . . 2
- The Shell . . . . . 3
- Streams 5
- Standard streams . . . 6
- Pipes . . . 7
- Text . . 9
- Philosophy . . . . 10
- In a nutshell . . . . . 10
- More nuts and bolts . 10
- End Notes . . . . . 11
2 Version Control with Git
- Background . . . . 13
- What is Git . . . . 13
- Setting Up . . . . . 14
- Online Materials . 14
- Basic Git Concepts 15
- Common Git Workflows . . . 15
- Linear Move from Working to Remote
- Discarding changes in your working copy . 17
- Erasing changes . . . 17
- Remotes . . 17
- Merge conflicts . . . . 18
3 Building a Data Cleaning Pipeline with Python
- Simple Shell Scripts . . . . . 19
- Template for a Python CLI Utility . . . 21
II The Classic Regression Models
4 Notation
- Notation for Structured Data 24
5 Linear Regression
- Introduction . . . . 26
- Coefficient Estimation: Bayesian Formulation . . . 29
- Generic setup . . . . . 29
- Ideal Gaussian World 30
- Coefficient Estimation: Optimization Formulation 33
- The least squares problem and the singular value decomposition
- Overfitting examples . 39
- L2 regularization . . . 43
- Choosing the regularization parameter . . . 44
- Numerical techniques 46
- Variable Scaling and Transformations . 47
- Simple variable scaling 48
- Linear transformations of variables . . . . . 51
- Nonlinear transformations and segmentation . . . . . 52
- Error Metrics . . . 53
- End Notes . . . . . 54
6 Logistic Regression
- Formulation . . . . 55
- Presenter’s viewpoint 55
- Classical viewpoint . . 56
- Data generating viewpoint . . . . 57
- Determining the regression coefficient w 58
- Multinomial logistic regression . . . . . 61
- Logistic regression for classification . . . 62
- L1 regularization . 64
- Numerical solution 66
- Gradient descent . . . 67
- Newton’s method . . . 68
- Solving the L1 regularized problem . . . . . 70
- Common numerical issues . . . . 70
- Model evaluation . 72
- End Notes . . . . . 73
7 Models Behaving Well
- End Notes . . . . . 75
III Text Data
8 Processing Text
- A Quick Introduction . . . . 77
- Regular Expressions . . . . . 78
- Basic Concepts . . . . 78
- Unix Command line and regular expressions 79
- Finite State Automata and PCRE . . . . . 82
- Backreference . . . . . 83
- Python RE Module 84
- The Python NLTK Library . 87
- The NLTK Corpus and Some Fun things to do . . . . 87
IV Classification
9 Classification
- Quick Introduction . . . . 90
- Naive Bayes . . . . 90
- Smoothing 93
- Measuring Accuracy . . . . . 94
- Error metrics and ROC Curves . 94
- Other classifiers . . 99
- Decision Trees . . . . 99
- Random Forest . . . . 101
- Out-of-bag classification . . . . . 102
- Maximum Entropy . . 103
V Extras
10 High(er) performance Python
- Memory hierarchy 107
- Parallelism . . . . 110
- Practical performance in Python . . . . 114
- Profiling . . 114
- Standard Python rules of thumb 117
- For loops versus BLAS 122
- Multiprocessing Pools 123
- Multiprocessing example: Stream processing text files 124
- Numba . . 129
- Cython . . 129
DSC Resources
- Services: Hire a Data Scientist | Search DSC | Classifieds | Find a Job
- Contributors: Post a Blog | Ask a Question
- Follow us: @DataScienceCtrl | @AnalyticBridge
Popular Articles