See the
platform
in action
Welcome to Ataccama’s guide to Reference Data Management, where we will outline what this exciting component of Master Data Management (MDM) is in a way that is useful, informative, and educational for Data analysts, Governance stewards, and other data aficionados.
What exactly is reference data?
Reference data is data that defines the values that are used to classify and characterize other data. This is sometimes known - not always correctly - as Master Data, Golden Record, Golden Copy, or Single Source of Truth.
RDM is an essential component of Master Data Management (MDM) that has evolved to become a mature data management discipline in its own right.
Think of the codes and descriptions that make up RDM as how businesses identify information about any number of business activities, such as information about financial transactions, locations, forms of measurements, and inventories.
What are some examples of reference data?
In theory, reference data could apply to just about anything. Think about ordering a brand new Jaguar because you’ve been so successful in your data career. You’ll be able to customize the type of engine, wheels, seats, warming mechanisms, audio options, and more. Somewhere in their system, Jaguar will have reference data tables that correspond to these options that will be in use when you order your car.
For example, here is a table that contains information about different available engines:
Code | Description | Main fuel type | Engine category |
D165-AWD-A-MHEV | Ingenium 2.0 liter 4-cylinder 163PS Turbocharged Diesel MHEV (Automatic) | Diesel | MHEV |
D200-AWD-A-MHEV | Ingenium 2.0 liter 4-cylinder 204PS Turbocharged Diesel MHEV (Automatic) | Diesel | MHEV |
D300-AWD-A-MHEV | Ingenium 3.0 liter 6-cylinder 300PS Turbocharged Diesel MHEV Automatic) | Diesel | MHEV |
P250-AWD-A | Ingenium 2.0 liter 4-cylinder 250PS Turbocharged Petrol (Automatic) | Petrol | PV |
P400e-AWD-A-PHEV | Ingenium 2.0 lite 4-cylinder 404PS Turbocharged Petrol PHEV (Automatic) | Petrol | PHEV |
Attributes such as Main fuel type and Engine category are examples of reference data. They help to categorize and describe the main data, or master data. In this specific case, this is product master data.
Here is an example of the ICD-10 codeset for classifying diseases:
Code | Description | Category | Sub-category |
A00 | Cholera | Certain infectious and parasitic diseases | Intestinal infectious diseases |
A01 | Typhoid and paratyphoid fevers | Certain infectious and parasitic diseases | Intestinal infectious diseases |
A02 | Other salmonella infections | Certain infectious and parasitic diseases | Intestinal infectious diseases |
A03 | Shigellosis | Certain infectious and parasitic diseases | Intestinal infectious diseases |
A04 | Other bacterial intestinal infections | Certain infectious and parasitic diseases | Intestinal infectious diseases |
The list continues to row Z98.89.
These codes are also grouped into taxonomies and hierarchies, for easier consumption by users.
Given reference data’s utility, it is typically a static, i.e. non-changing dataset that once defined, governs the use of data in an organization until it is changed.
How crucial is RDM?
Both volumes of data and opportunities to use it effectively grow through rapidly evolving and maturing technologies. That means data increasingly has an operational impact for organizations— the day-to-day running of businesses, charities, educational institutions, and other establishments would be extremely difficult - if not impossible - without it.
With the increasing volumes of transactional data, the importance of reference data is growing too. Without it, we would have difficulty organizing and sharing common information about millions of products and services. We would even have difficulty determining the correct locations for delivery of those products and services.
Disorganized, incorrect, or inaccessible reference data can disrupt and delay business processes. Assigning the wrong code to a particular transaction can negatively impact billing, inventory counts and replenishment, or undermine the accuracy and destination of shipments.
Therefore, better management of data and its corresponding reference data can present enormous business value in the form of:
- More accurate analytics (e.g., which type of engine is sold the most regardless of the specific car model?)
- Reliable regulatory reporting
- Effective sharing of datasets between organizations (mutual understanding thanks to using industry-standard reference data codes)
Ultimately, the above help organizations serve their customers better, increase operating margin, and make their employees more productive and happier.
We will discuss more benefits later on in this article.
What are some common reference data problems?
Like any other data, if ungoverned and unmanaged, reference data becomes a liability, rather than an asset.
Here are some examples of what actually happens in organizations:
- Lack of a single source of the truth for reference data, which results in
- Departments managing reference data in silos and synchronizing them slowly and irregularly
- Using older versions
- Lack of data governance principles
- Rules for using external reference data
- 4-eye principle for approving changes, i.e., business workflow
- Using unsuitable tools: typical Excel → more in a separate section
What this means for data stewards:
- Manually collecting requirements and reconciling differences throughout departments
- Manually updating reference data in consuming systems
- Involving IT and maintaining SQL queries or other technical solutions
What this means for the business and operations:
- Fines in highly regulated industries such as pharmaceuticals or airlines.
- Inconsistent financial reporting
- Lack of alignment between different lines of business
- Operational mistakes
Here is an example of what can happen when two different sets of codes represent the same concept:
Code used in dept 1 | Description used in dept 1 | Code used in dept 2 | Description used in dept 2 |
DTS | Dental services | OD2 | Dental care |
SK | Skin care | DRM | Dermatology |
VIS | Vision specialist | OPM | Ophthalmologist |
Why you shouldn’t manage reference data in Excel
The typical reason users of data run into problems with reference data is because they have not been managed by a sophisticated tool and instead built and maintained in programs such as Excel which are not fit for purpose.
This is why reference data should never be managed in spreadsheets.
Spreadsheets are not flexible or feature-rich enough to fully manage any type of data once the volumes increase or relationships between dictionaries arise.
- They provide very limited search functionality.
- They offer very limited ways to validate data or proactively alert users to change.
- They lack automated versioning capabilities and can exist across independent silos.
- Manually merging spreadsheet updates can take hours and simple tasks, such as adding a column to a codebook, for example, can impact their ability to be exported and shared with other systems and users.
They lack the advanced approval workflow capabilities that are necessary to update critical reference data.
Learn more about why managing reference data centrally is the only way to do it and why Excel is not the right tool to do that in this webinar.
How does RDM (reference data management) work?
RDM provides governance architecture to centralize and manage any reference data.
The data itself is stored in a single RDM hub or database repository because most data stewards, architects, and reference data managers believe an RDM solution should be the single and trusted source for the creation and company-wide distribution of reference data.
Reference data modeling
Reference data is modeled, or structured by domain.
The model may be based on best practices or heavily influenced by external industry standards. HR, manufacturing, finance, and other departments will require different codeset models and approaches.
Reference data imports and mapping
Given the stability of codesets, importing reference data from outside the hub should not be a frequent occurrence. New codesets will typically be devised and modeled within the system as part of a deliberative process. Pre-existing codeset data may be copied and modified.
With more complex solution architectures involving legacy systems, centralized reference data authoring might not be possible. In this case, reference data is authored and synchronized between several systems, including the RDM solution itself, which holds the single version of the truth.
Data quality
Data quality is important for reference data just as it is for any other data. Therefore, it is important to embed data validations into the reference data authoring workflows. This will prevent mistakes from happening. Additionally, RDM solutions are able to maintain the referential integrity of data automatically.
Workflows and approval
Since reference data has such an organization-wide impact, any changes to it are usually subject to a tightly controlled workflow process. As part of RDM governance, a designated team of data stewards and subject matter experts participate in this process to ensure the 4-eye principle before any changes are published. They collaboratively view, comment, create, update, or even delete codesets. Upon approval, they will be published for viewing and disseminated across all relevant business systems.
Versioning
Anticipating required updates to reference data should be a priority as policy changes, new products, and parts, location changes or other new procedures will necessitate ongoing management.
RDM addresses this requirement through versioning. In the example below, the first row indicating the current, or active part specification QIX-102-A, is set to be automatically ‘retired’ on March 13, 2015. Upon retirement, RDM will simultaneously publish the replacement part specification, QIX-201-C.
ID | Part Specification | Valid from | Valid to |
33 | QIX-102-A | 2000-01-01 | 2015-03-13 |
34 | QIX-201-C | 2015-03:13 | 2099-01-01 |
Sharing trusted reference data
In addition to promoting a centralized governance approach, RDM’s data management mission includes sharing data with all relevant business systems and users.
There are multiple ways governed reference data can be leveraged and made available throughout an organization:
- System/application integration: RDM can directly integrate with business systems to publish data directly to hard drives, the cloud, or FTP used by HR, finance, sales, marketing, and more.
- Data Catalogs: Reference data is a major offering now provided by modern data catalogs. Data catalogs can become a trusted source for reference data, enabling users to search and request code sets for business tasks and transactions.
- Data Warehouses: Companies can also use their data warehouses to store archived or historical reference data for analytical reporting. Data marts are likely to maintain commercially active versions of reference data to support customer-facing activities and applications
What are the benefits of RDM?
Much like other data domains, the value of reference data is entirely dependent on the knowledge, skill, and understanding of the people who know how to use it.
A mature approach to RDM can bring many benefits to organizations that invest in it, including:
- Automated reference data governance
- Reduced overheads by eliminating reliance on spreadsheets, data shared by emails, or SQL query complexities
- A governance environment that increases the quality and availability of data
- A single source of reference data for designated users and systems
- Increased trust in data and improved decision making as a result
- Improved reporting for tracking policy and regulation compliance
- Increased productivity through accurate and accessible reference data
- Synergy with Master Data Management (MDM) to produce a 360 view of any business domain.
And much more.
Start managing reference data centrally
Reference data is, in essence, the “What, Where, When, Who, and How” of data and provides much-needed context and relevance to data that is critical for business operations.
If you’re looking for more information about technology that can help you implement RDM, or if you’re generally interested in it as part of your data career, take a closer look at our modular platform Ataccama One. It’s the only solution that natively integrates RDM, MDM, Data Quality, and a Data Catalog. Get started in no time in our cloud!