Originally Published on the DATUM, LLC Site: Building Solid Foundations in a data Swamp
Much has been written about Big Data, Data Science and Artificial Intelligence and how these will change the world through the insights being derived from the data. This especially applies to the unstructured data. A recent article in the Harvard Business Review indicated that “cross-industry studies show that on average, less than half of an organization’s structured data is actively used in making decisions—and less than 1% of its unstructured data is analyzed or used at all.”[1]
There are a few challenges however:
- How do users create understanding and ensure they have the correct data for their needs if it has no structure?
- How do you create a single logical view of data in a big data world, where things are not only highly variable, but also are often widely disbursed.
- How do you address analytical requirements, where the notion of data quality and how it is managed, varies significantly?
- How do you expose the data lake(s) to users in a form that is discoverable, understandable and useable?
This blog is the first in a series to explore the data management and governance perspectives related to these four challenges.
Challenge #1: Unstructured Data
The question of how to deal with unstructured data consistently raises its head as a challenge for organizations. First let’s get a few things out there:
- There is no such thing as truly unstructured data. There is always a structure of some sort.
- Knowing what you have and having the right tools are foundational capabilities.
- The degree of structure required for data to be useful is variable and context driven.
Let’s take these in order:
Creating Structure
Structure is created in one of two ways:
- Through reorganizing data so that it has structure
- Through labeling data
The former is what happens to data in a traditional data environment as it is moved through the ecosystem – from Source to Enterprise Data Warehouse for example. The latter is what happens in a big data environment. The data is never moved, but rather labels are added to it to provide the ability analyze that data.
Note: Data can be labeled incrementally. Newly acquired data, can only be labelled with the acquisition date, the source, and the file type. As data moves through the data lifecycle, it will be “curated” to add additional context.
A little labelling goes along way!
How much the data needs to be labelled to be useful can be viewed on a continuum. At one end simply knowing that you are looking at emails provides enough information to know how to organize them; while at the other end, social media sentiment analysis will require extensive labelling. Regardless, the right tools are required to provide logical structure to the unstructured data.
When it comes to tools that cater to unstructured data one key capability is entity tagging or entity extraction tools that can recognize an entity and tag it with a label that makes sense to the organization – essentially tag it with the approved glossary term. Entities can be:
- Anything from a simple named list such as a “product”; or
- Extremely complex and map entities into semantic ontologies such as a “JV” is a “Joint Venture”, which is a type of “Company”, which is an “organization” that has “owners”.
Complementing the tagging capability is a flexible indexing capability. Tools like Elastic Search allow users to search based on the structures discovered in the data. For example, a “Joint Venture “is a type of company. Additionally, these tools can create an index to allow discovery of similarities in text.
The key point is that once data is organized, users and applications can begin to apply big data techniques to expose insights:
- How do emails cluster on a timeline?
- Are organizations mentioned in the text? (Could be Joint Ventures, Partnerships, LLCs, PLCs, and so on.)
- Is there a change in frequency over time? Related to what entity types / categories?
What does this mean from a data management perspective?
From a data management perspective unstructured data will require some new capabilities. However, in some respects, it really is more of the same: What data do I have and where is it? Is my data labelled to communicate understanding? Is my data easy to acquire and apply in my context?
If you think of tags or labels as descriptive metadata, and the list of tags and labels as reference metadata, then you can place this activity into the traditional data management context. In order for data to be discovered, understood and integrated across systems and use cases, organizations need to:
- Have a disciplined approach to how data is described and labelled. This starts with creating a set of glossary terms that can be linked to define meaning. [2]
- Implement the governance framework that ensures the data is aligned to – and remains aligned to – the business understanding of what the data is, and how it is used.
Organizations often do not face this challenge until they need to manage data across the various operational silos, geographic regions or functional domains. The need to understand product lifecycle data with regional focus group data is an example of a cross functional/geography/silo data mash up that delivers high impact insights.
Be sure to check back in as we address the next three challenges!
References
[1] Harvard Business Review What’s Your Data Strategy? Leandro DalleMule, Thomas H. Davenport; May –June 2017 Issue https://hbr.org/2017/05/whats-your-data-strategy
[2] With reference to linking of data, the simple link types are “subset of”, “superset of”, “same as”. (See SKOS for a deeper discussion on knowledge organization). For example, using this approach one can tag pharmaceutical products to identify synonyms as recognized by the ISO standards; and synonyms of the same product that are commercial names. This is the challenge faced by organizations implementing the IDMP standards.
[3] For a good case study of data integration across disparate data sets using SKOS metadata see Healthcare Research Information