The overall challenge is one of sorting out the metadata management within the organization. Presumably this is creating data quality problems of some sort.
It is somewhat hard to evaluate the metadata question in the absence of a few other data points, as metadata is everywhere, and is influenced by numerous factors: overall data management policies, the information architecture, the system architecture, and the security policies to name a few. Additionally, there is often a mis-alignment between these factors and how the organization’s management structure supports data management best practices.
First step – define the data quality issue. This breaks out along three dimensions: 1. how users find and retrieve data assets; 2. How data is delivered; and, 3. how data is managed.
Breaking data down along these dimensions will expose the function that is creating the duplication and the kind of metadata involved.
Additionally viewing the problem this way will address how the organization is thinking about the process. How is the library function addressed? Is there a cataloging function that keeps an inventory of all data assets available to the organization? Is there a role of data scientist – perhaps data “packager” is a better term for most organizations. Is there an organizational element to why metadata is getting fragmented? An example of this would be that the Librarian function is organizationally part of the Research Division. Data collection and packaging is an IT function. The Library software that manages the search and cataloging functions is managed separately from the metadata management system supporting the data warehouse. Anything done by Research creates fragmentation, and possibly duplication of metadata.
Once this perspective is applied, one gets a better idea of the impact related to the data quality problem. Issues of a cumbersome search interface, can be addressed in the short term through better training, or increased resources on the help desk. However, security or audit related shortcomings might result in a security breach which would have much greater impact.
So what to do? Clearly this is not just a metadata problem. In many organizations, the challenge of duplicated metadata is created by legacy environments that have grown fragmented over time. The roadmap that defines the path forward will be unique for each organization. However, a few best practices are called for.
- Creation of an information architecture / framework that all can agree on.
- Creation of an agreed on set of “states” that can be applied to data assets that are flowing though this information architecture.
- Creation of a policy/process/standards matrix that defines how these are applied to data assets based on where they are in the information architecture
- The above implies that there is a data governance component within the organization’s management structure – if this does not exist, it needs to be established.
From a platform perspective there are two best practice considerations: a metadata registry should be evaluated; and metamodels created by which all metadata are viewed. The registry is most important, as it imposes a discipline on the metadata management process.
Think of a metadata registry as the “reference data” or the “controlled vocabulary” for metadata. For any given data asset, it defines what metadata one should expect to see. If it is not in the metadata registry, it should not exist. All of the IT tools that are engaged in moving, enhancing or creating derived works from data assets should use the metadata registry as the reference source for all activity related to metadata. This reduces the likelihood that metadata will need to be duplicated, and if it is duplicated, it reduces the likelihood that it is duplicated incorrectly.
This is managed by the ISO 11179 specification. My sense is that the development of a metadata registry is something that comes with Big Data. Just like reference data, and master data, metadata is a management challenge that is supported by IT capabilities NOT the other way around.