Data Preparation

Once you have deployed Autonomous Identity, you can prepare your dataset into a format that meets the schema.

The initial step is to obtain the data as agreed upon between ForgeRock and your company. The files contain a subset of user attributes from the HR database and entitlement metadata required for the analysis. Only the attributes necessary for analysis are used.

Clients can transfer the data to ForgeRock via some portable media, like USB, or through a connector from the client systems. The analysts review the data to ensure that it is in its proper formatted form.

There are a number of steps that must be carried out before your production entitlement data is input into Autonomous Identity. The summary of these steps are outlined below:

Data Collection

Typically, the raw client data is not in a form that meets the Autonomous Identity schema. For example, a unique user identifier can have multiple names, such as user_id, account_id, user_key, or key. Similarly, entitlement columns can have several names, such as access_point, privilege_name, or entitlement.

To get the correct format, here are some general rules:

Submit the raw client data in various file formats: .csv, .xlsx, .txt. The data can be in a single file, or multiple files. Data includes user attributes, entitlements descriptions, and entitlement assignments.
Duplicate values should be removed.
Add optional columns for additional training attributes, for example, MANAGERS_MANAGER and MANAGER_FLAG.
Merge user attribute information and entitlement metadata into the entitlement assignments. This creates one large dataframe that should have an individual row for each assignment. Each row should contain the relevant user attribute profile information and entitlement metadata for the assignment.
Rename any columns that Autonomous Identity uses to the appropriate names, for example, employeeid to USR_KEY, entitlement_name to ENT.
Build seven dataframes needed for Autonomous Identity, for example, features, labels, HRName, etc. This step may also include adding some additional columns to each dataframe, for example, labels['IS_ASSIGNED'] = 'Y'.
Write out the seven dataframes to seven csv files.

CSV Files and Schema

ForgeRock provides a transformation script that takes in raw data and converts them to acceptable .csv formatted files.

You can access a Python script template to transform your client files to correct the .csv files. Run the following steps:

On the target machine, go to the /data/conf/.
Open a text editor, and view the zoran_client_transformation.py template. You can edit this script for your company's dataset.

The script outputs seven files with the following contents:

CSV Files Outputs

Files	Description
features.csv	Contains one row for each employee with all of their user attributes.
labels.csv	Contains the user-to-entitlement mappings. Also, includes usage data if provided.
HRName.csv	Maps user ID’s to their names. This file is needed for the UI.
EntName.csv	Maps entitlement ID’s to their names. This file is needed for the UI.
RoleOwner.csv	Maps entitlements ID’s to the employees who "owns" these entitlements, the people responsible for approving or revoking accesses to this entitlement.
JobAndDeptDesc.csv	Maps user ID’s to the department in which they work, and also includes a description of their job within the company.
AppToEnt.csv	Maps entitlements to the applications they belong to. This file is needed for the UI.

The schema for the input files are as follows:

CSV Files Schema

Files	Schema
features.csv	This file depends on the attributes that the client wants to include. These are some required columns: USR_KEY. Specifies the user's unique ID. USR_DISPLAY_NAME. Specifies the user's name. If not provided, use the user ID, but the name works best for the UI. USR_MANAGER_KEY. Specifies the ID of the user's manager. USR_EMP_TYPE. Specifies the employment status of the user, for example, PERMANENT, CONTRACT, EMPLOYEE, NON-EMPLOYEE, VENDOR, etc. IS_ACTIVE. Specifies whether this user is an active employee. Sometimes companies submit inactive accounts, which is not included in the analysis.
label.csv	USR_KEY. Specifies the user's unique ID. ENT. Specifies the unique entitlement identifier. HIGH_RISK. Determines whether an access is considered HIGH, MEDIUM, or LOW risk. IS_ASSIGNED. Determines whether an access is assigned (used internally). LAST_USAGE. Specifies the last time an entitlement was accessed.
HRName.csv	USR_KEY. Specifies the user's unique ID. USR_NAME. Specifies a human readable username. For example, `John Smith`.
AppToEnt.csv	ENT. Specifies the unique entitlement identifier. APP_NAME. Specifies a human readable application name.. APP_ID. Specifies the unique application identifier.
RoleOwner.csv	ROLE. Specifies the unique user ID of the entitlement owner. ENT. Specifies the unique entitlement identifier.
JobAndDeptDesc.csv	USR_KEY. Specifies the user's unique ID. DEPARTMENT. Specifies the human readable department. JOB_DESCRIPTION. Specifies the human readable job description.