Run the Analytics Pipeline
The Analytics pipeline is the heart of Autonomous Identity. It analyzes, calculates, and determines the association rules, confidence scores, predictions, and recommendations for assigning entitlements to the users.
The analytics pipeline is an intensive processing operation that can take some time depending on your dataset and configuration. To ensure an accurate analysis, the data needs to be as complete as possible with little or no null values. Once you have prepared the data, you must run a series of analytics jobs to ensure an accurate rendering of the entitlements and confidence scores.
The initial pipeline step is to create, edit, and apply the analytics_init_config.yml
configuration file. The analytics_init_config.yml
file configures the key properties for the analytics pipeline. In general, you will not need to change this file too much, except for the Spark configuration options. For more information, see Prepare Spark Environment.
Next, run a job to validate the data, then, when acceptable, ingest the data into the database. After that, run a final audit of the data to ensure accuracy. If everything passes, run the data through its initial training process to create the association rules for each user-assigned entitlement. This is a somewhat intensive operation as the analytics generates a million or more association rules. Once the association rules have been determined, they are applied to user-assigned entitlements.
After the training run, run predictions to determine the current confidence scores for all assigned entitlements. After this, run a recommendations job that looks at all users who do not have a specific entitlement but should, based on their user attribute data. Once the predictions and recommendations are completed, run an insight report to get a summary of the analytics pipeline run, and an anomaly report that reports any anomalous entitlement assignments.
The final steps are to push the data to the backend Cassandra or MongoDB database, and then configure and apply any UI configuration changes to the system.
The general analytics process is outlined as follows:
Note
The analytics pipeline requires that DNS properly resolve the hostname before its start. Make sure to set it on your DNS server or locally in your /etc/hosts
file.
Analytic Actions
The Deployer-based installation of the analytics services provides an "analytics" alias (alias analytics='docker exec -it analytics bash analytics') on the server, with which you can perform a number of actions for configuration or to run the pipeline on the target machine.
Command | Description |
---|---|
analytics create-template | Run this command to create the |
analytics apply-template | Apply the changes to |
analytics ingest | Ingest data into Autonomous Identity. |
analytics audit | Run a data audit to ensure if meets the specifications. |
analytics train | Runs an analytics training run. |
analytics predict-as-is | Run as-is predictions. |
analytics predict-recommendation | Run recommendations. |
analytics publish | Push the data to the Apache Cassandra/MongoDB backend. |
analytics anomaly | Create the Anomaly report. |
analytics insight | Create the Insights report. |
analytics create-assignment-index | Generate the Elasticsearch index. |
analytics create-ui-config | Create the |
analytics apply-ui-config | Apply the |
analytics run-pipeline | Run all of the pipeline commands at once in the following order: validate, ingest, audit, train, predict-as-is, predict-recommendation, publish, create-ui-config, apply-ui-config. |
analytics upgrade | Run an analytics upgrade when updating from one Autonomous Identity version to the latest version. For more information, see Upgrade Autonomous Identity. |
Create Initial Analytics Template
The main configuration file for the Analytics service is analytics_init_config.yml
. You generate this file by running the analytics create-template command.
On the deployer node, SSH to the target node:
$
ssh autoid@<Target-IP-Address>
Create the initial configuration template. The command generates the
analytics_init_config.yml
in the/data/conf/
directory.$
analytics create-template
You should see the following output if the job completed successfully:
analytics template config file created at CONF_DIR/analytic_init_config.yml. Please edit it and run apply-template
Edit the
analytics_init_config.yml
for any specific configurations for your deployment.For information on data preparation, see Data Preparation.
For information on Spark tuning, see Prepare Spark Environment.
For information on data ingestion, see Data Ingestion.
Copy the .csv files to the
/data/input
folder. Note if you are using the sample dataset, it is located at the/data/conf/demo-data
directory.$
cp *.csv /data/input/
Apply the template to the analytics service. The command generates the
analytics_config.yml
file in the/data/conf/
directory. Autonomous Identity uses this configuration for other analytic jobs.Note
Note that you do not directly edit the
analytics_config.yml
file. If you want to make any additional configuration changes, edit theanalytics_init_config.yml
file again, and then re-apply the new changes using the analytics apply-template command.$
analytics apply-template
You should see the following output if the job completed successfully:
analytics template config file created at CONF_DIR/analytic_init_config.yml. Please edit it and run apply-template
You have the option now to run the analytics pipeline individually in a specific order, or run the full pipeline all at once.
If this is your first time running the pipeline, we recommend running each step individually in the order shown in the procedure, starting with "Ingest the Data Files".
If you are familiar with the analytics pipeline process, run the full pipeline as presented in "Run Full Pipeline".
Ingest the Data Files
By this point, you should have prepared and validated the data files for ingestion into Autonomous Identity. This process imports the seven .csv files into the Cassandra or MongoDB database.
Ingest the data into the Cassandra or MongoDB database:
Make sure Cassandra or MongoDB is up-and-running.
Make sure you have determined your Spark configuration in terms of the number of executors and memory.
Run the data ingestion command.
$
analytics ingest
You should see the following output if the job completed successfully:
Script : /home/analytics/autoid-analytics/ai_ingest.py is successful
Run Data Audit
Before running the analytics training run, we need to do one final audit of the data. The audit runs through the seven .csv files as loaded into the database and generates initial metrics for your company.
Run the Data Audit:
Verify that the .csv files are in the
/data/input/
directory.Run the audit command.
$
analytics audit
You should see the following output if the job completed successfully:
Script : Script : /home/analytics/autoid-analytics/ai_test.py is successful
You can access the audit report (
audit_report.txt
in the/data/input/spark_runs/reports
directory on the target server.
The script provides the following metrics:
File | Description |
---|---|
features.csv |
|
labels.csv |
|
apptoent.csv |
|
roleowner.csv |
|
Other Insights |
|
Run Training
Now that you have ingested the data into Autonomous Identity. Start the training run.
Training involves two steps: the first step is an initial machine learning run where Autonomous Identity analyzes the data and produces the association rules. In a typical deployment, you can have several million generated rules. Each of these rules are mapped from the user attributes to the entitlements and assigned a confidence score.
The initial training run may take time as it goes through the analysis process. Once it completes, it saves the results directly to the Cassandra database.
Start the training process:
Run the training command.
$
analytics train
You should see the following output if the job completed successfully:
Script : /home/analytics//autoid-analytics/ai_train.py is successful
Run Predictions and Recommendations
After your initial training run, the association rules are saved to disk. The next phase is to use these rules as a basis for the predictions module.
The predictions module is comprised of two different processes:
as-is. During the As-Is Prediction process, confidence scores are assigned to the entitlements that users do not have. During a pre-processing phases, the
labels.csv
andfeatures.csv
are combined in a way that appends them to only the access rights that each user has. The as-is process maps the highest confidence score to the highestfreqUnion
rule for each user-entitlement access. These rules will then be displayed in the UI and saved directly to the Cassandra database.recommendation. During the Recommendations process, confidence scores are assigned to all entitlements. This allows Autonomous Identity to recommend entitlements to users who do not have them. The lowest confidence entitlement is bound by the confidence threshold used in the initial training step. During a pre-processing phase, the
labels.csv
andfeatures.csv
are combined in a way that appends them to all access rights. The script analyzes each employee who may not have a particular entitlement and predicts the access rights that they should have according to their high confidence score justifications. These rules will then be displayed in the UI and saved directly to the Cassandra database.
Run as-is Predictions:
In most cases, there is no need to make any changes to the configuration file. However, if you want to modify the analytics, make changes to your
analytics_init_config.yml
file.For example, check that you have set the correct parameters for the association rule analysis (for example, minimum confidence score) and for deciding the rules for each employee (for example, the confidence window range over which to consider rules equivalent).
Run the as-is predictions command.
$
analytics predict-as-is
You should see the following output if the job completed successfully:
Script : /home/analytics/autoid-analytics/ai_predict_asis.py is successful
Run Recommendations:
Make any changes to the configuration file,
analytics_init_config.yml
, to ensure that you have set the correct parameters (for example, minimum confidence score).Run the recommendations command.
$
analytics predict-recommendation
You should see the following output if the job completed successfully:
Script : /home/analytics/autoid-analytics/ai_predict_recommend.py is successful
Publish the Analytics Data
Populate the output of the training, predictions, and recommendation runs to a large table with all assignments and justifications for each assignment. The table data is then pushed to the Cassandra or MongoDB backend.
Publish the data to the backend:
$
analytics publish
You should see the following output if the job completed successfully:
Script : /home/analytics/autoid-analytics/ai_load.py is successful
Run Anomaly Report
Autonomous Identity provides a report on any anomalous entitlement assignments that have a low confidence score but are for entitlements that have a high average confidence score. The report's purpose is to identify true anomalies rather than poorly managed entitlements. The script writes the anomaly report to a Cassandra or MongoDB database. The report is written to a report_anomaly
table in the autoid_analytics
keyspace in Cassandra and MongoDB.
The report generates the following points:
Identifies potential anomalous assignments.
Identifies the number of users who fall below a low confidence score threshold. For example, if 100 people all have low confidence score assignments to the same entitlement, then it is unlikely an anomaly. The entitlement is either missing data or the assignment is poorly managed.
Run the anomaly report:
$
analytics anomaly
You should see the following output if the job completed successfully:
Script : /home/analytics/autoid-analytics/ai_report_anomaly.py is successful
Run the Insight Report
Next, run an insight report on the generated rules and predictions that were generated during the training and predictions runs. The analytics command generates insight_report.txt
and insight_report.xlsx
and writes them to the /data/input/spark_runs/reports
directory.
The report provides the following insights:
Number of assignments received, scored, and unscored.
Number of entitlements received, scored, and unscored.
Number of assignments scored >80% and <5%.
Distribution of assignment confidence scores.
List of the high volume, high average confidence entitlements.
List of the high volume, low average confidence entitlements.
Top 25 users with more than 10 entitlements.
Top 25 users with more than 10 entitlements and confidence scores greater than 80%.
Top 25 users with more than 10 entitlements and confidence scores less than 5%.
Breakdown of all applications and confidence scores of their assignments.
Supervisors with most employees and confidence scores of their assignments.
Top 50 role owners by number of assignments.
List of the "Golden Rules", high confidence justifications that apply to a large volume of people.
Run the Insight Report:
$
analytics insight
You should see the following output if the job completed successfully:
Script : /home/analytics//autoid-analytics/ai_report_insight.py is successful
Create Assignment Index
Next, generate the Elasticsearch index using the analytics create-assignment-index command.
Create the index:
$
analytics create-assignment-index
You should see the following output if the job completed successfully:
Script : CreateElasticIndex is successful
Create the Analytics UI Config File
Once the analytics pipeline has completed, you can configure the UI using the analytics create-ui-config command if desired.
Run the analytics create-ui-config to generate the
ui_config.json
file in the/data/conf/
directory. The file sets what is displayed in the Autonomous Identity UI.$
analytics create-ui-config
You should see the following output if the job completed successfully:
Script : init.py is successful
In most cases, you can run the file as-is. If you want to make changes, make edits to the
ui_config.json
file and save it to the/data/conf/
directory.Apply the file.
$
analytics apply-ui-config
You should see the following output if the job completed successfully:
Script : init.py is successful
If every pipeline process has ended successfully, you have successfully run the full analytics pipeline.
Run Full Pipeline
You can run the full analytics pipeline with a single command using the run-pipeline command. Make sure your data is in the correct directory, /data/input
, and that any UI configuration changes are set in the ui_config.json
file in the /data/conf/
directory.
The run-pipeline command runs the following jobs in order:
analytics ingest
analytics audit
analytics train
analytics predict-as-is
analytics predict-recommendation
analytics publish
analytics anomaly
analytics insight
analytics create-assignment-index
analytics create-ui-config
analytics apply-ui-config
Run the full pipeline:
$
analytics run-pipeline
You should see the following output if the job completed successfully:
Script : init.py is successful #analytics-run-pipeline-output Pipe Line Ends