Python与机器学习

以下几个库奠定了Python今天在机器学习领域的地位：NumPy、scikit-learn和TensorFlow。

Numpy

多维数组：ndarray
Universal function：ufunc
常用函数：routines
SciPy扩展
- 稀疏矩阵：scipy.sparse、coo_matrix、csc_matrix、csr_matrix
- 积分：scipy.integrate
- 插值：scipy.interpolate
- 最优化：scipy.optimize
- 统计：scipy.stats

scikit-learn

估计器（Estimator）：分类器模型，主要包含两个方法fit()和predict()。
转换器（Transformer）：数据预处理与数据转换，主要包含三个方法fit()、transform()、fit_transform()。
流水线（Pipeline）：流水线的输入是一连串的数据挖掘步骤，最后一步必须是估计器，前几步是转换器；每步用元组（‘名称’，步骤）来表示。
预处理（Preprocessing）
- 规范化：MinMaxScaler，最大最小值规范化；Normalizer，每条数据各特征的和为1；StandardScaler，使各特征均值为0，方差为1
- 编码：LabelEncoder，将字符串类型的数据转化为整型；OneHotEncoder，特征用一个二进制数字来表示；Binarizer，将数值特征二值化；MultiLabelBinarizer，将多标签二值化。
特征（Feature）
- 特征抽取：sklearn.feature_extraction
- 特征选择：sklearn.feature_selection
降维：sklearn.decomposition
组合：sklearn.ensemble
评估：sklearn.metrics，包括accuracy_score，confusion_matrix，classification_report，precision_recall_fscore_support等
交叉验证：sklearn.cross_validation
网格搜索：sklearn.grid_search

TensorFlow

TensorFlow是目前最流行的深度学习框架。我们先引用一段官网对于TensorFlow的介绍，来看一下Google对于它这个产品的定位。

TensorFlow™ is an open source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them. The flexible architecture allows you to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API.

上文并没有提到大红大紫的Deep Learning，而是聚焦在一个更广泛的科学计算应用领域。引文的关键词有：

Numerical Computation：应用领域是数值计算，所以TensorFlow不仅能支持Deep Learning，还支持其他机器学习算法，甚至包括更一般的数值计算任务（如求导、积分、变换等）。
Data Flow Graph：用graph来描述一个计算任务。
Node：代表一个数学运算（mathmatical operations，简称ops），这里面包括了深度学习模型经常需要使用的ops。
Edge：指向node的edge代表这个node的输入，从node引出来的edge代表这个node的输出，输入和输出都是multidimensional data arrays，即多维数组，在数学上又称之为tensor。这也是TensorFlow名字的由来，表示多维数组在graph中流动。
CPUs/GPUs：支持CPU和GPU两种设备，支持单机和分布式计算。

作者基于官方文档对TensorFlow进行了初步的学习，并对实践过程进行了记录：

Python与数据分析

数据框：pandas
日期时间处理：datetime
统计绘图
- matplotlib
- seaborn
- bokeh

Python与工具开发

环境管理：virtualenv
命令行工具开发：click
多进程、多线程并发：multiprocessing，multiprocessing.dummy
单元测试：unittest
日志：logging
服务主备：zookeeper
将C代码变为Python库

Python服务端开发

数据库：pymysql
ORM：SQLAlchemy
WSGI工具：werkzeug
HTML模板：jinja2
Web框架：flask
socket编程
自行实现WSGI

Python客户端开发

HTTP请求：requests
JSON解析：json
HTML解析：beautifulsoup
XML解析：ElementTree，lxml
爬虫：scrapy

其他

R与Python对比

左手python右手R