Quick start guide
Let's build a classifier for the classic iris dataset. If you don't have
RDatasets, Pkg.add
it.
using RDatasets: dataset
iris = dataset("datasets", "iris")
# ScikitLearn.jl expects arrays, but DataFrames can also be used - see
# the corresponding section of the manual
X = convert(Array, iris[[:SepalLength, :SepalWidth, :PetalLength, :PetalWidth]])
y = convert(Array, iris[:Species])
Next, we load the LogisticRegression model from scikit-learn's library.
using ScikitLearn
# This model requires scikit-learn. See
# http://scikitlearnjl.readthedocs.io/en/latest/models/#installation
@sk_import linear_model: LogisticRegression
Every model's constructor accepts hyperparameters (such as regression
strength, whether to fit the intercept, the penalty type, etc.) as
keyword arguments. Check out ?LogisticRegression
for details.
model = LogisticRegression(fit_intercept=true)
Then we train the model and evaluate its accuracy on the training set:
fit!(model, X, y)
accuracy = sum(predict(model, X) .== y) / length(y)
println("accuracy: $accuracy")
> accuracy: 0.96
Cross-validation
This will train five models, on five train/test splits of X and y, and return the test-set accuracy of each:
using ScikitLearn.CrossValidation: cross_val_score
cross_val_score(LogisticRegression(), X, y; cv=5) # 5-fold
> 5-element Array{Float64,1}:
> 1.0
> 0.966667
> 0.933333
> 0.9
> 1.0
See this tutorial for more information.
Hyperparameter tuning
LogisticRegression
has a regularization-strength parameter C
(smaller is
stronger). We can use grid search algorithms to find the optimal C
.
GridSearchCV
will try all values of C
in 0.1:0.1:2.0
and will
return the one with the highest cross-validation performance.
using ScikitLearn.GridSearch: GridSearchCV
gridsearch = GridSearchCV(LogisticRegression(), Dict(:C => 0.1:0.1:2.0))
fit!(gridsearch, X, y)
println("Best parameters: $(gridsearch.best_params_)")
> Best parameters: Dict{Symbol,Any}(:C=>1.1)
Finally, we plot cross-validation accuracy vs. C
using PyPlot
plot([cv_res.parameters[:C] for cv_res in gridsearch.grid_scores_],
[mean(cv_res.cv_validation_scores) for cv_res in gridsearch.grid_scores_])
Saving the model to disk
Both Python and Julia models can be saved to disk
import JLD, PyCallJLD
JLD.save("my_model.jld", "model", model)
model = JLD.load("my_model.jld", "model") # Load it back