# Getting Started In this section, we'll run some "Hello world" `primrose` jobs. However, rather than print a string, we'll do some more useful tasks: - run a read-write job (i.e., read data from a source and write it to some sink; that is, an `Extract-Transfer-Load` job) - run machine learning models, including clustering and classification ## Your first primrose job Once you have `primrose` installed, you can now run a job. First you'll need to create your own project directory, by using the `primrose` command line as follows: ``` primrose create-project --name ``` This will create a directory with everything you need to get started, including example config files. Just navigate into this directory to begin. ### A read write job Open `config/hello_world_read_write.json`: ``` { "implementation_config": { "reader_config": { "read_data": { "class": "CsvReader", "filename": "data/tennis.csv", "destinations": [ "write_output" ] } }, "writer_config": { "write_output": { "class": "CsvWriter", "key": "data", "dir": "cache", "filename": "tennis_output.csv" } } } } ``` There are two tasks: - `read_data`: this run an instance of `CsvReader` which loads data from `data/tennis.csv` into a pandas dataframe and sends the data to the next task: `write_output`. - `write_output`: this receives the data (associated with a key `data`) and uses an instance of `CsvWriter` to write the data out to `cache/tennis_output.csv`. It might not be obvious that it is pandas under the hood but otherwise this configuration file should be relatively intuitive. ### Running the job To run the job, type ``` primrose run --config config/hello_world_read_write.json ``` You should see something similar to: ``` 2019-07-23 21:15:09,075 INFO configuration.py __init__: Loading config file at config/hello_world_read_write.json 2019-07-23 21:15:09,081 INFO configuration.py check_sections: OK: all sections are supported operations 2019-07-23 21:15:09,081 INFO configuration.py check_config: OK: all class keys are present 2019-07-23 21:15:09,081 INFO configuration.py check_config: OK: all classes recognized 2019-07-23 21:15:09,081 INFO configuration.py check_config: OK: good necessary_configs 2019-07-23 21:15:09,082 INFO configuration_dag.py create_dag: Checking configuration DAG 2019-07-23 21:15:09,082 INFO configuration_dag.py create_dag: OK: good referential integrity 2019-07-23 21:15:09,082 INFO configuration_dag.py check_connected_components: OK: 1 connected component 2019-07-23 21:15:09,082 INFO configuration_dag.py check_for_cycles: OK: no cycles found 2019-07-23 21:15:09,082 INFO dag_runner.py __init__: Traverser is of class 2019-07-23 21:15:09,083 INFO dag_runner.py filter_sequence: Taking nodes to run from default 2019-07-23 21:15:09,083 INFO dag_runner.py run: received node read_data of type reader_config and class CsvReader 2019-07-23 21:15:09,083 INFO csv_reader.py run: Reading data/tennis.csv from CSV 2019-07-23 21:15:09,109 INFO dag_runner.py run: received node write_output of type writer_config and class CsvWriter 2019-07-23 21:15:09,109 INFO csv_writer.py run: Saving data data to cache/tennis_output.csv 2019-07-23 21:15:09,115 INFO dag_runner.py run: All done. Bye bye! ``` That's it. You have run a **primrose** job. ### Examining the output The first half of this output is a set of validation checks against the configuration. You can read more about these checks in [DAG Configurations](README_DAG_CONFIG.md). The `Traverser` is some code that determines the order in which the `primrose` nodes are processed. Importantly, there are four output lines that show the actual work being done: ``` 2019-07-23 21:15:09,083 INFO dag_runner.py run: received node read_data of type reader_config and class CsvReader 2019-07-23 21:15:09,083 INFO csv_reader.py run: Reading data/tennis.csv from CSV 2019-07-23 21:15:09,109 INFO dag_runner.py run: received node write_output of type writer_config and class CsvWriter 2019-07-23 21:15:09,109 INFO csv_writer.py run: Saving data data to cache/tennis_output.csv ``` You should now have a file `cache/tennis_output.csv` that matches the input file `data/tennis.csv`. ## A Machine Learning Job: Clustering Now, let's do some machine learning. We'll run a basic K-means clustering and plot the clusters. ### Plotting the Graph Before we examine the configuration file, let's plot a graph of the DAG by typing: ``` primrose plot --config config/hello_world_cluster_simple_train.json --outfile graph.png ``` This will write an image of the DAG to `graph.png`, without actually running the nodes. ![](img/hw_cluster.png) Here we see a `read_data` node feeding into a `normalize_data` node feeding into a `cluster_model` node. The output feeds to `write_data` and also to `cluster_plotter`. ### Examining the configuration With that background, let's examine the configuration file. This is `config/hello_world_cluster_simple_train.json ``` { "metadata": { "explanation": ["this is a minimal Sklearn clustering example in training mode:", " * read in some data from CSV", " * sklearn.preprocessing.StandardScaler transform the data but no train test split!", " * cluster with sklearn.cluster.KMeans", " * predict on same training data, appending predictions column to data", " * save enriched data to CSV", " * plot clusters and save image to disk"] }, "implementation_config": { "reader_config": { "read_data": { "class": "CsvReader", "filename": "data/unclustered.csv", "destinations": [ "normalize_data" ] } }, "pipeline_config": { "normalize_data": { "class": "SklearnPreprocessingPipeline", "operations": [ {"class":"preprocessing.StandardScaler", "columns": ["x1", "x2"], "args": {"with_mean": true, "with_std": true}} ], "is_training": true, "destinations": [ "cluster_model" ] } }, "model_config": { "cluster_model":{ "class": "SklearnClusterModel", "mode": "train", "X": ["x1","x2"], "model": {"class": "cluster.KMeans", "args": {"n_clusters": 6, "random_state": 42}}, "destinations": [ "write_data", "cluster_plotter" ] } }, "writer_config": { "write_data": { "class": "CsvWriter", "key": "data", "dir": "cache", "filename": "clustered_output.csv" } }, "dataviz_config": { "cluster_plotter": { "class": "ClusterPlotter", "id_col": "cluster_id", "filename": "clusters.png", "title": "Results of KMeans(k=6)" } } } } ``` Most of it should be relatively intuitive but let's walk through it. - `read_data` uses a `CsvReader` to read in some data. This fees to `normalize_data`. - `normalize_data` is a pipeline, a preprocessing step. This is going to run `sklearn.preprocessing.StandardScaler` on the columns `x1` and `x2`, scaling to zero mean and unit standard deviation. It is in training mode. - This feeds into `cluster_model` which uses `sklearn.cluster.KMeans` with k=6 (and seed=42) to cluster the data. That feeds to a `CSvWriter` and to a `ClusterPlotter`. the plotter creates a scatterplot with title "Results of KMeans(k=6)" and saving to `clusters.png`. You can now run this primrose job with the following command: ``` primrose run --config config/hello_world_cluster_simple_train.json ``` Here is the output: ![](img/clusters.png) Congratulations. You have now machined-learned in `primrose`! ## A Machine Learning Job with Model Caching: Classification In this last example, we will train another Sklearn classifier model. However, here, we cache the trained model so that it can be used against a new dataset at a later stage—i.e., in separate `primrose` job—in either `predict` or `eval` mode. The dataset we are using is Tom Mitchell's play tennis example to illustrate decision trees: Here, we run two separate `primrose` jobs: 1) train and 2) predict. Each job has its own configuration file. `config/hello_world_classifier_train.json` reads the `tennis` dataset, does a train test split, runs a Sklearn decision tree model in mode `predict`, and writes both the `TransformerSequence` (set of transformation performed on the features) and the `model` to a local cache using `dill`: ``` ... "encode_and_split": { "class": "EncodeTrainTestSplit", "is_training": true, "target_variable": "play", "training_fraction": 0.65, "seed": 42, "destinations": [ "decision_tree_model", "write_encoder" ] } ... "write_encoder": { "class": "DillWriter", "key": "transformer_sequence", "dir": "cache", "filename": "hello_world_encoder.dill" }, "write_model": { "class": "DillWriter", "key": "model", "dir": "cache", "filename": "hello_world_model.dill" } ``` A separate configuration file, `config/hello_world_classifier_predict.json` is designed to make predictions using the cached model. Ideally, we would use a different dataset to make predictions on, but we are using the same tennis dataset here for simplicity. In this configuration, we read the data and the two cached objects: ``` "reader_config": { "read_data": { "class": "CsvReader", "filename": "data/tennis.csv", "destinations": [ "encode_and_split" ] }, "read_model": { "class": "DillReader", "filename": "cache/hello_world_model.dill", "destinations": [ "decision_tree_model" ] }, "read_encoder": { "class": "DillReader", "filename": "cache/hello_world_encoder.dill", "destinations": [ "encode_and_split" ] } }, ``` run through the model with `mode: predict`: ``` "decision_tree_model": { "class": "SklearnClassifierModel", "sklearn_classifier_name": "tree.DecisionTreeClassifier", "mode": "predict", "grid_search_scoring": null, "cv_folds": null, "destinations": [ "write_output"], "model_parameters": {} } ``` and, finally, write out the predictions ``` "writer_config": { "write_output": { "class": "CsvWriter", "key": "predictions", "dir": "cache", "filename": "hello_world_predictions.csv" } } ``` Try running it yourself. First, run ``` primrose run --config config/hello_world_classifier_train.json ``` and inspect the encoder and model files in the `cache` folder. Now run the predict job: ``` primrose run --config config/hello_world_classifier_predict.json ``` and find the predictions in the `cache` folder. You are now set up to design realistic `primrose` runs in a production setting. ## Next Learn more about `primrose` configurations: [DAG Configurations](README_DAG_CONFIG.md)