Introducing an AEP Data Dictionary

Barry Mann
5 min readOct 4, 2024

--

Inspired by the Analytics Health Dashboard.

With over 12 years of Adobe Analytics implementations, the very first thing I always do upon taking over a new account is run an Analytic Health Dashboard (AHD), which is found on adobe.ly/aahealth

The Adobe Analytics Health Dashboard (AHD) is an .xls that helps monitor and maintain the quality of your Adobe Analytics implementation. It connects to the management and reporting API’s.

It provides:

  1. Anomaly detection for variables and events
  2. Tracking of all analytics components (events, eVars, props)
  3. Essential Governance monitoring (I once had a client who had: ‘paperfree’, paper-free, PaperFree and Paperfree as values for billing type)!
  4. Insights into variable usage and performance.

This dashboard is essential for ensuring data accuracy, quickly identifying problems, and optimizing your analytics setup. It’s accessible at adobe.ly/aahealth and is regularly updated to improve its functionality.

v2 and v3 exist, to be honest I am more of a v2 kinda guy.

I have been heavily involved in RT-CDP and Web SDK implementations these days and I thought wouldn’t it be useful if the same existed for AEP?

Using some rudimentary university Python programming, I assembled a rudimentary prototype and placed it on my Github.

Luma example

Download an .xls of the data dictionary here !!

Naturally, I can’t show you client data, so here is a demo based on the Luma Dataset which I imported, credit to Saurabh Mahapatra for literally writing the book on this: https://data-distiller.all-stuff-data.com/pre-107-ingesting-json-test-data-into-adobe-experience-platform.

AEP Data Dictionary Based on the test Luma dataset

For each schema item, its just Value — Instance pairs reverse sorted:

Luma Test Data emailid.id
productListItems.name

.. and so on for everything which has a column in the dataset with corresponding dot notation.

Its also obtainable from the Query interface:

SELECT 
CASE
WHEN commerce.checkouts.`value` = 1 THEN 'checkouts'
WHEN commerce.productViews.`value` = 1 THEN 'productViews'
WHEN commerce.productListAdds.`value` = 1 THEN 'productListAdds'
ELSE 'unknown' -- Catch cases where none of the conditions match
END AS commerce_event_type,
COUNT(*) AS event_count
FROM
luma_web_details
GROUP BY
commerce_event_type
ORDER BY
event_count DESC; -- Sort by event count from high to low

Which gives:

Query Engine showing uniques count for eventType

How It works

I said it was rudimentary …

This script:

  1. Logging Setup: It sets up logging to both the console and a log file to track the script’s activity.
  2. JSON File Reading: The script reads a JSON file (located at Z:\Luma Test\json_input\1.json) If successful, it processes each line of the JSON file, extracting and appending valid JSON objects to a list.
  3. Data Flattening and Frequency Counting: It flattens the JSON structure and counts the frequency of each key-value pair within the data. The data is stored in a dictionary that is used Counter to track the occurrences of each value.
  4. CSV Writing: The frequency counts are written to a CSV file, where each key from the JSON is represented as columns, and the values are sorted by their frequency in descending order.
  5. Main Execution: It processes the JSON file and saves the output to a CSV file named luma-dictionary-<timestamp>.csv in the Z:\Luma Test\out directory.

When you get it working, fill the input folder up with 2 weeks of json files exported from the platform.

Connections > destinations
  1. Export your dataset as json (dump it into a Data Landing Zone or FTP) onto your local machine
  2. Place all the files in a folder on a PC of decent specification.
  3. Run the script (may take some time, try with 1 one-hour files and build up to 14 days of data, subject to processing power)
  4. Inspect and format the resultant .csv

Thus:

FTP location containing platform exports.

Ensure you export as JSON, as parquet files are difficult to work with.

It’s data governance gold:

I can see everything that is wrong with the data!

eventType is inconsistent , cartAddition contains leading whitespace
leading whitespace

There is whitespace at the start of the eventType, so when marketing folks without data collection knowledge attempt to build an audience:

empty audience due to incorrect eventType

Data Dictionary is the Rosetta stone for marketing folks

One of the problems when working with the audience builder in AEP is that clustered values rarely work:

Here we easily build an audience based on a marketing campaign:

Will it work for *any* dataset, not just web data?

Yes, since it just processes json files, I have successfully created dictionaries from the profile store. Same process.

Futurescope

It solves a lot of problems for us and is very useful.

It could theoretically be run in Jupyter inside the platform, subject to computing limits. We do not have data distiller so we hit RAM & CPU limits in there quite often and frankly, Jupyter in the platform is a little unstable.

It needs to cope with additions to the schema without re-coding. Our production data dictionary is over 300 columns wide.

API, it could (and really ought) be run via the API, first building an index of all items that dot notation and pulling values and clustering each one.

--

--

Barry Mann

London & Vietnam base Digital Implementation Analyst since 2009. Founder London Web Analytics Contractor Group & one time Web Analytics Wednesday (London)