If the goal is to post a sample jupyter notebook online as a portfolio demonstration piece, then one of the first “problems” one encounters is, what dataset to use. If the goal of the portfolio piece is moreso to demonstrate model-building skills, and less-so domain knowledge of the dataset or associated feature-engineering skills, then starting with raw data would be tedious and distracting to a lean demonstration. One could theoretically prepare the data in a separate analysis, and then host a prepared dataset on e.g. github and fetch it using a url in the portfolio piece, but, again, this is distracting if data munging and feature engineering is not the primary goal for the demo.
Fortunately, this is a common problem, and therefore, there exist standardized ways of fetching data.
Please see this jupyter notebook for some examples of good security-related datasets, as well as some examples of loading them.
The rest of this post documents a rabbit hole related to datasources for a machine learning assignment using scikit-learn that I am building. The problems in the domain are often binomial targets, so logistic regression as a starter algorithm.
There exists this scikit-learn example for using logistic regression with mixed datatypes – nominal and continuous. It shows a pipeline for handling these data, which only really requires changing the datasource and correctly flagging which features are numeric and which are categorical.
The example imports
fetch_openml function from sklearn:
from sklearn.datasets import fetch_openml
Rabbit hole begins
OpenML boasts boat-fulls of dataset, “tasks”, “flows”, and “runs.” In theory, it provides a platform for
data scientists to upload and score their analyses and predictions of and for various datasets. The uploaded “tasks”, “flows”, and “runs” are supposed to be accompanied by source code or READMEs for replication purposes, but from my brief overview, that functionality is a hot mess. Key-value pair fields for run uploads get abused by auto-uploaders which leads to a mix of function names, calls, and parameters. Clicking on the “download” buttons on these various areas provided
json with another wide variety of key-value pairs. Maybe I just don’t have the vision, but in my opinion, the data sets are the main thing of value on OpenML.
sklearn.pipeline.Pipeline( columntransformer=sklearn.compose._column_transformer.ColumnTransformer( numeric=sklearn.pipeline.Pipeline( missingindicator=sklearn.impute.MissingIndicator, imputer=sklearn.preprocessing.imputation.Imputer, standardscaler=sklearn.preprocessing.data.StandardScaler ), nominal=sklearn.pipeline.Pipeline( simpleimputer=sklearn.impute.SimpleImputer, onehotencoder=sklearn.preprocessing._encoders.OneHotEncoder ) ), extratreesclassifier=sklearn.ensemble.forest.ExtraTreesClassifier )(1)
I did the formatting. It was all in a single string. This was part of the documentation, which I guess is mildly helpful, after the 10 minutes to unreplicably find it and parse it.
The datasets on OpenML are in ARFF format. The format specifies data attributes at the top, and the data at the bottom. Scikit-learn’s
you guessed it, fetch and load an openml dataset into a numpy array.
There exists a New Standard (tm) called Frictionless Data which seeks to be the de facto standard for formatting datsets. OpenML had said it is maybe possibly is going to support the frictionless data format, but not yet.
So there’s an internet “I made this” regurgitation cycle for you: many of the datasets on OpenML come from the UCI Machine Learning Repository, and datahub.io now hosts OpenML datasets.