Run the Analytics Pipeline

The Analytics pipeline is the heart of Autonomous Identity. It analyzes, calculates, and determines the association rules, confidence scores, predictions, and recommendations for assigning entitlements to the users.

The analytics pipeline is an intensive processing operation that can take some time depending on your dataset and configuration. To ensure an accurate analysis, the data needs to be as complete as possible with little or no null values. Once you have prepared the data, you must run a series of analytics jobs to ensure an accurate rendering of the entitlements and confidence scores.

The initial pipeline step is to create, edit, and apply the analytics_init_config.yml configuration file. The analytics_init_config.yml file configures the key properties for the analytics pipeline. In general, you will not need to change this file too much, except for the Spark configuration options. For more information, see Prepare Spark Environment.

Next, run a job to validate the data, then, when acceptable, ingest the data into the database. After that, run a final audit of the data to ensure accuracy. If everything passes, run the data through its initial training process to create the association rules for each user-assigned entitlement. This is a somewhat intensive operation as the analytics generates a million or more association rules. Once the association rules have been determined, they are applied to user-assigned entitlements.

After the training run, run predictions to determine the current confidence scores for all assigned entitlements. After this, run a recommendations job that looks at all users who do not have a specific entitlement but should, based on their user attribute data. Once the predictions and recommendations are completed, run an insight report to get a summary of the analytics pipeline run, and an anomaly report that reports any anomalous entitlement assignments.

The final steps are to push the data to the backend Cassandra or MongoDB database, and then configure and apply any UI configuration changes to the system.

The general analytics process is outlined as follows:

Note

The analytics pipeline requires that DNS properly resolve the hostname before its start. Make sure to set it on your DNS server or locally in your /etc/hosts file.

Analytic Actions

The Deployer-based installation of the analytics services provides an "analytics" alias (alias analytics='docker exec -it analytics bash analytics') on the server, with which you can perform a number of actions for configuration or to run the pipeline on the target machine.

A Summary of the Analytics Services Commands

Command	Description
analytics create-template	Run this command to create the `analytics_init_config.yml` configuration file.
analytics apply-template	Apply the changes to `analytics_init_config.yml` file and create the `analytics_config.yml` file.
analytics ingest	Ingest data into Autonomous Identity.
analytics audit	Run a data audit to ensure if meets the specifications.
analytics train	Runs an analytics training run.
analytics predict-as-is	Run as-is predictions.
analytics predict-recommendation	Run recommendations.
analytics publish	Push the data to the Apache Cassandra/MongoDB backend.
analytics anomaly	Create the Anomaly report.
analytics insight	Create the Insights report.
analytics create-assignment-index	Generate the Elasticsearch index.
analytics create-ui-config	Create the `ui_config.json` file.
analytics apply-ui-config	Apply the `ui_config.json` file.
analytics run-pipeline	Run all of the pipeline commands at once in the following order: validate, ingest, audit, train, predict-as-is, predict-recommendation, publish, create-ui-config, apply-ui-config.
analytics upgrade	Run an analytics upgrade when updating from one Autonomous Identity version to the latest version. For more information, see Upgrade Autonomous Identity.

Create Initial Analytics Template

The main configuration file for the Analytics service is analytics_init_config.yml. You generate this file by running the analytics create-template command.

On the deployer node, SSH to the target node:
```
$ ssh autoid@<Target-IP-Address>
```
Create the initial configuration template. The command generates the analytics_init_config.yml in the /data/conf/ directory.
```
$ analytics create-template
```
You should see the following output if the job completed successfully:
```
analytics template config file created at CONF_DIR/analytic_init_config.yml. Please edit it and run apply-template
```
Edit the analytics_init_config.yml for any specific configurations for your deployment.
- For information on data preparation, see Data Preparation.
- For information on Spark tuning, see Prepare Spark Environment.
- For information on data ingestion, see Data Ingestion.
Copy the .csv files to the /data/input folder. Note if you are using the sample dataset, it is located at the /data/conf/demo-data directory.
```
$ cp *.csv /data/input/
```
Apply the template to the analytics service. The command generates the analytics_config.yml file in the /data/conf/ directory. Autonomous Identity uses this configuration for other analytic jobs.
Note
Note that you do not directly edit the analytics_config.yml file. If you want to make any additional configuration changes, edit the analytics_init_config.yml file again, and then re-apply the new changes using the analytics apply-template command.
```
$ analytics apply-template
```
You should see the following output if the job completed successfully:
```
analytics template config file created at CONF_DIR/analytic_init_config.yml. Please edit it and run apply-template
```
You have the option now to run the analytics pipeline individually in a specific order, or run the full pipeline all at once.
If this is your first time running the pipeline, we recommend running each step individually in the order shown in the procedure, starting with "Ingest the Data Files".
If you are familiar with the analytics pipeline process, run the full pipeline as presented in "Run Full Pipeline".

Ingest the Data Files

By this point, you should have prepared and validated the data files for ingestion into Autonomous Identity. This process imports the seven .csv files into the Cassandra or MongoDB database.

Ingest the data into the Cassandra or MongoDB database:

Make sure Cassandra or MongoDB is up-and-running.
Make sure you have determined your Spark configuration in terms of the number of executors and memory.
Run the data ingestion command.
```
$ analytics ingest
```
You should see the following output if the job completed successfully:
```
Script : /home/analytics/autoid-analytics/ai_ingest.py is successful
```

Run Data Audit

Before running the analytics training run, we need to do one final audit of the data. The audit runs through the seven .csv files as loaded into the database and generates initial metrics for your company.

Run the Data Audit:

Verify that the .csv files are in the /data/input/ directory.
Run the audit command.
```
$ analytics audit
```
You should see the following output if the job completed successfully:
```
Script : Script : /home/analytics/autoid-analytics/ai_test.py is successful
```
You can access the audit report (audit_report.txt in the /data/input/spark_runs/reports directory on the target server.

The script provides the following metrics:

CSV File Audit

File	Description
features.csv	Number of columns/categories (number of features). Number of rows (number of users). Number of unique values in each column. Number of null values in each column. 10 most common values for specific attributes in each column.
labels.csv	Number of columns/categories (number of labels). Number of rows (number of entitlement assignments). Number of unique values and category names. Number of unique values in the high risk column. Number of entitlements with single mappings (that is, only one user has been assigned this entitlement), 2, 3-5, 6-10, 11-100, 100-1000, 1000+ mappings. Number of multiple numbers of mappings. Entitlement and count of assigned users. Assigned user and count.
apptoent.csv	Number of columns/categories. Number of rows. Number of unique entitlements. Number of unique applications. Number of users in entitlement assignments. Application name and associated number of entitlements with the application
roleowner.csv	Number of columns/categories. Number of rows. Number of unique entitlements. Number of unique role owners. Top 10 role owners with most assigned entitlements.
Other Insights	Number of user keys in labels file but not in features file Number of manager keys in features that do not exist as Users

Run Training

Now that you have ingested the data into Autonomous Identity. Start the training run.

Training involves two steps: the first step is an initial machine learning run where Autonomous Identity analyzes the data and produces the association rules. In a typical deployment, you can have several million generated rules. Each of these rules are mapped from the user attributes to the entitlements and assigned a confidence score.

The initial training run may take time as it goes through the analysis process. Once it completes, it saves the results directly to the Cassandra database.

Start the training process:

Run the training command.
```
$ analytics train
```
You should see the following output if the job completed successfully:
```
Script : /home/analytics//autoid-analytics/ai_train.py is successful
```

Run Predictions and Recommendations

After your initial training run, the association rules are saved to disk. The next phase is to use these rules as a basis for the predictions module.

The predictions module is comprised of two different processes:

as-is. During the As-Is Prediction process, confidence scores are assigned to the entitlements that users do not have. During a pre-processing phases, the labels.csv and features.csv are combined in a way that appends them to only the access rights that each user has. The as-is process maps the highest confidence score to the highest freqUnion rule for each user-entitlement access. These rules will then be displayed in the UI and saved directly to the Cassandra database.
recommendation. During the Recommendations process, confidence scores are assigned to all entitlements. This allows Autonomous Identity to recommend entitlements to users who do not have them. The lowest confidence entitlement is bound by the confidence threshold used in the initial training step. During a pre-processing phase, the labels.csv and features.csv are combined in a way that appends them to all access rights. The script analyzes each employee who may not have a particular entitlement and predicts the access rights that they should have according to their high confidence score justifications. These rules will then be displayed in the UI and saved directly to the Cassandra database.

Run as-is Predictions:

In most cases, there is no need to make any changes to the configuration file. However, if you want to modify the analytics, make changes to your analytics_init_config.yml file.
For example, check that you have set the correct parameters for the association rule analysis (for example, minimum confidence score) and for deciding the rules for each employee (for example, the confidence window range over which to consider rules equivalent).
Run the as-is predictions command.
```
$ analytics predict-as-is
```
You should see the following output if the job completed successfully:
```
Script : /home/analytics/autoid-analytics/ai_predict_asis.py is successful
```

Run Recommendations:

Make any changes to the configuration file, analytics_init_config.yml, to ensure that you have set the correct parameters (for example, minimum confidence score).

Run the recommendations command.

$ analytics predict-recommendation

You should see the following output if the job completed successfully:

Script : /home/analytics/autoid-analytics/ai_predict_recommend.py is successful

Publish the Analytics Data

Populate the output of the training, predictions, and recommendation runs to a large table with all assignments and justifications for each assignment. The table data is then pushed to the Cassandra or MongoDB backend.

Publish the data to the backend:

$ analytics publish

You should see the following output if the job completed successfully:

Script : /home/analytics/autoid-analytics/ai_load.py is successful

Run Anomaly Report

Autonomous Identity provides a report on any anomalous entitlement assignments that have a low confidence score but are for entitlements that have a high average confidence score. The report's purpose is to identify true anomalies rather than poorly managed entitlements. The script writes the anomaly report to a Cassandra or MongoDB database. The report is written to a report_anomaly table in the autoid_analytics keyspace in Cassandra and MongoDB.

The report generates the following points:

Identifies potential anomalous assignments.
Identifies the number of users who fall below a low confidence score threshold. For example, if 100 people all have low confidence score assignments to the same entitlement, then it is unlikely an anomaly. The entitlement is either missing data or the assignment is poorly managed.

Run the anomaly report:

$ analytics anomaly

You should see the following output if the job completed successfully:

Script : /home/analytics/autoid-analytics/ai_report_anomaly.py is successful

Run the Insight Report

Next, run an insight report on the generated rules and predictions that were generated during the training and predictions runs. The analytics command generates insight_report.txt and insight_report.xlsx and writes them to the /data/input/spark_runs/reports directory.

The report provides the following insights:

Number of assignments received, scored, and unscored.
Number of entitlements received, scored, and unscored.
Number of assignments scored >80% and <5%.
Distribution of assignment confidence scores.
List of the high volume, high average confidence entitlements.
List of the high volume, low average confidence entitlements.
Top 25 users with more than 10 entitlements.
Top 25 users with more than 10 entitlements and confidence scores greater than 80%.
Top 25 users with more than 10 entitlements and confidence scores less than 5%.
Breakdown of all applications and confidence scores of their assignments.
Supervisors with most employees and confidence scores of their assignments.
Top 50 role owners by number of assignments.
List of the "Golden Rules", high confidence justifications that apply to a large volume of people.

Run the Insight Report:

$ analytics insight

You should see the following output if the job completed successfully:

Script : /home/analytics//autoid-analytics/ai_report_insight.py is successful

Create Assignment Index

Next, generate the Elasticsearch index using the analytics create-assignment-index command.

Create the index:

$ analytics create-assignment-index

You should see the following output if the job completed successfully:

Script : CreateElasticIndex is successful

Create the Analytics UI Config File

Once the analytics pipeline has completed, you can configure the UI using the analytics create-ui-config command if desired.

Run the analytics create-ui-config to generate the ui_config.json file in the /data/conf/ directory. The file sets what is displayed in the Autonomous Identity UI.
```
$ analytics create-ui-config
```
You should see the following output if the job completed successfully:
```
Script : init.py is successful
```
In most cases, you can run the file as-is. If you want to make changes, make edits to the ui_config.json file and save it to the /data/conf/ directory.
Apply the file.
```
$ analytics apply-ui-config
```
You should see the following output if the job completed successfully:
```
Script : init.py is successful
```
If every pipeline process has ended successfully, you have successfully run the full analytics pipeline.

Run Full Pipeline

You can run the full analytics pipeline with a single command using the run-pipeline command. Make sure your data is in the correct directory, /data/input, and that any UI configuration changes are set in the ui_config.json file in the /data/conf/ directory.

The run-pipeline command runs the following jobs in order:

analytics ingest
analytics audit
analytics train
analytics predict-as-is
analytics predict-recommendation
analytics publish
analytics anomaly
analytics insight
analytics create-assignment-index
analytics create-ui-config
analytics apply-ui-config

Run the full pipeline:

$ analytics run-pipeline

You should see the following output if the job completed successfully:

Script : init.py is successful
#analytics-run-pipeline-output

Pipe Line Ends