Archive | Data Management RSS feed for this section

Data Prep – More than a Buzzword?

25 Feb

“Data Prep” has become a popular phrase over the last year or so – why? At a practical level, data preparation tools are providing the same functionality that traditional ETL (extract, transform, load) tools provide. Are data prep tools just a marketing gimmick to get organizations to buy more ETL software? This blog seeks to address why data prep capabilities have become a topic of conversation within the data and analytics communities.

Traditionally, data prep has been viewed as slow and laborious, often associated with linear, rigid methodologies. Recently, however, data prep has become synonymous with data agility. It is a set of capabilities that pushes the boundaries of who has access to data, and how they can apply it to business challenges. Looked at this way, data prep is a foundational capability for digital transformation, which I define as the ability of companies to evolve in an agile fashion in some key dimension of their business model. The business driver of most transformation programs is to fundamentally change key business performance metrics, such as revenue, margins, or market share. Viewed in this way, data prep tools are a critical addition to the toolbox when it comes to driving key business metrics.

Consider the way that data usage has evolved, and the role that data prep capabilities are playing.

Analytics is maturing. Analytics is not a new idea. However, for years it was a function relegated to Operations Research (OR) folks and statisticians. This is no longer the case. As BI and reporting tools grew more powerful and increasingly enabled self service for end users, users began asking questions that were more analytical in nature.

Data-Driven decisions require data “in context.” Decision-making and the process that supports it require data to be evaluated in the context of the business or operational challenge at hand. How management perceives an issue will drive what data is collected and how it is analyzed. In the 1950’s and 1960’s, operations research drove analytics, and the key performance indicators were well established. These included time in process, mean time to failure, yield and throughput. All of these were well understood and largely prescriptive. Fast forward to now. Analytics is broadly applied and used well beyond the scope of operations research. New types of analysis driven in large part by social media trends are much less prescriptive and value is driven by context. Examples include: key opinion leader, fraud networks, perceptual mapping, and sentiment analysis.

Big data is driving the adoption of machine learning. Machine learning requires the integration of domain expertise with the data in order to expose “features” within the data that enhance the effectiveness of machine learning algorithms. The activity that identifies and organizes these features is called “feature engineering.” Many data scientists would not equate “data preparation” with feature engineering, yet there is a strong correlation to what an analyst does. A business analyst invariably creates features as they prepare their data for analysis: 1) observations are placed on a time line; 2) revenue is totaled by quarters and year; 3) customers are organized by location, by cumulative spend, and so on. Data Prep in this context is the organization of data around domain expertise, and is a critical input to the harnessing of big data through automation.

Data science is evolving and data engineering is now a thing. Data engineering focuses on how to apply and scale the insights from data science into an operational context. It’s one thing for a data scientist to spend time organizing data for modest initiatives or limited analysis, but for scaled up operational activities involving business analysts, marketers and operational staff, data prep must be a capability that is available to staff with a more generalized skill set. Data engineering supports building capabilities that enable users to access, prepare and apply data in their day-to-day lives.

“Data Prep” in the context of the above is enabling a broader community of data citizens to discover, access, organize and integrate data into these diverse scenarios. This broad access to data using tools that organize and visualize is a critical success factor for organizations seeking the business benefits of digitally enabling their organization. Future blogs will drill down on each of the above to explore how practitioners can evolve their data prep capabilities and apply them to business challenges.

The topic of protecting personal information will grow in importance in 2019

19 Nov
IAPP Annual Report 2018
For those interested in the protection of personal information, the IAPP has an interesting – albeit rather hefty – IAPP-EY Annual Privacy Governance Report 2018, and the NTIA has released its comments from industry on pending privacy regulation. I noted that the IAPP report indicates most solutions are still almost all or entirely manual. I am not sure how this does not become a management nightmare as organizations evolve their data maturity to align operations and marketing more. Data management as a process discipline and some degree of automation are going to be critical capabilities to ensure personal information is protected. There are simply too many opportunities for error when this is done manually. 
I recently published an article in TDAN on automating data management and governance through machine learning. It is not just about ML, other capabilities will be required. However, as long as organizations rely on manual processes only, it opens up risk and places the burden on management to enforce policies that are often resisted as they are perceived as a burden on actually doing business. Data management as a process discipline in conjunction with automated processes will reduce operational overhead and risk.

DGIQ 2018

12 Jul

The DGIQ conference this year went well. I had two presentations, caught up with industry colleagues and customers. It helped that it was in San Diego – and the weather relative to the hot mugginess of the Mid Atlantic was excellent.

My presentation on GDPR was surprisingly well attended. I say surprising in that the deadline has passed, and I find that there are still companies that are formulating their  plans. However, I am beginning to feel a bit like Samuel Jackson.


In the GDPR presentation, the goal was to focus attention on not only doing the right thing to be compliant, but also doing it right. How do we reduce the stress and overhead of dealing with regulators. We call this “Audit Resilience.”  I spoke to a number of people that are taking a wait and see approach to GDPR compliance. Interestingly even though they are taking this approach, they are still getting requests to remove personal information. It seems to me that if you are taking a wait and see approach, you really still need to be able to remove personal information from at least the web site otherwise, you risk triggering a complaint, and then … you have no defense. Goal has to be to do everything not to trigger a complaint. The presentation took about 15 minutes, and the rest of the time was spent demonstrating the data control model in the DATUM governance platform – Information Value Management.

Also had the pleasure of presenting with Lynn Scott who co chairs the Healthcare Technology & Innovation practice at Polsinelli with Bill Tanenbaum – what we wanted to do was push home the point that collaboration is key when dealing with thorny risk and compliance issues. We tried to have some fun with this one.

I will be at the Data Architecture Summit in Chicago in October. The session will cover:

  • What are the requirements to ensure management is “audit resilient”?
  • What is a Control System and how is it related to a Data Control Model?
  • What is “regulatory alignment” from a data perspective?
  • How do I build a Data Control Model?
  • What role do advanced techniques (AI, Machine Learning) play in audit resilience?

Hope to see you all there

3stooges happy

Will the US evolve towards a GDPR “like” approach to personal information?

3 Jul


In a conversation with a lawyer a few months ago, the comment was made that the US has already implemented GDPR, they have just done small bits of it in each state; collectively similar to GDPR, but no one jurisdiction is anything like GDPR. Except now we have California implementing the California Consumer Privacy Act that will go into effect January of 2020. This regulation is similar in spirit and many details to GDPR. What is fascinating is how the bill was enacted. This article explains how California politics works, and points out that the rapid adoption of the legislation is actually an attempt to create a more flexible environment for companies to negotiate the various compromises that I am sure will come. It is also worth noting that for those companies that are well on the way towards GDPR compliance, they will essentially already be compliant with the California law. I do not see this being the last state to create or update their privacy laws. This was a trend that was already underway. However, California is a big state, and the home of many tech companies, and the State’s new law will surely have an influence on how other States address the privacy issue.

Update 1: Comments on non EU countries updating laws – Canada

Update 2: IAPP Comment on Californian law: 

Enterprise Data Worlds

22 May

I attended the Enterprise Data Worlds conference last month in San Diego. I was speaking on GDPR, and what you needed to do if you were just starting to think about GDPR  as the deadline is now so close. The meeting was well attended which was a surprise given how close we are to the deadline. The Facebook / Cambridge Analytica fiasco has drawn attention to the protection of personal information, and to GDPR in particular. What I see are the smaller companies getting drawn into the discussion, and realizing how big this might be for them. The deck is below.

In general, the show continues to improve. The keynote presentation by Mike Ferguson. Intelligent Business Strategies Ltd  Was interesting in that I am not sure if the same presentation had been given a couple of years ago that it would have been as well received. It would have been considered a fantasy by so many in the audience. Some of his key points:

  • Very comprehensive at the enterprise level – remember when Enterprise data management – or enterprise anything was a bad word?!
  • Tagging and classification is all going to be algorithm driven, and in the pipe – In his presentation IOT was driving the volume – had some good volume numbers.
  • Pushing the virtual enterprise data lake – everything tied together in a metadata hub

The products and vendor knowledge was the biggest surprise of the show – probably because expectations were low. In general, the tools discussions were more applied. Key observations:

  • Much more evolved presentations – hooked to business drivers.
  • Integrated products on the rise. Especially around the source to target discussion:
    • ETL, DQ, Profiling and Remediation are integrated into a single pipeline discussion
    • Sales people were more knowledgeable about how this works.
    • API injection of new capabilities into this pipeline – this was something that all professed to do. However, when pushed it was clear that there were varying stages of capability – All seemed to have APIs, the question seemed to be about how robust the API is.
    • Linked data / semantics was a bigger topic than normal. It is beginning to be discussed in an applied sense.
    • The FIBO (Financial Business Ontology) is a driver in this – more importantly it is being integrated into tools – so people can visualize how it is applied. This is pulling in the business side of the house
    • This is all metadata especially business metadata – this is shifting the discussion towards business.

Audit Resilience and the GDPR

15 May

Compliance activities for organizations are often driven from the legal or risk groups. The initial focus is on management’s position and actions required to be compliant; generally this starts with the creation of policies. This makes sense as policies are a reflection of management’s intent and provide guidance on how to put strategic thinking into action. The legal teams provide legal interpretation and direction with respect to risk. This is also incorporated into the policies. So, what happens next as your organization addresses challenges around ensuring effective implementation and subsequent operational oversight of policies required for General Data Protection Regulation (GDPR) compliance?


The challenges associated with GDPR as well as other compliance activities are centered on achieving “Audit Resilience.” We define this as the ability to address the needs of the Auditor – internal or external – in such a way that compliance is operationally enabled and can be validated easily and with minimal disruptions and cost. The goal is to reduce the stress, the chaos and the costs that often accompany these events to a manageable level.


Audit Resilience means that the auditor can:

  • Easily discern the clear line of site between Policies => Standards => Controls => Actors => Data.
  • Review and explicitly align governance artifacts (policies, standards and processes) to compliance requirements.
  • Access and validate the “controls” that ensure standards are applied effectively.
  • Find evidence of execution of the governance practices within the data.



GDPR compliance is a function of creating logical linkage and consistency across multiple functions and actors – down to the data level.  Details will vary based on the organization and the assessment of risk.

Overall, the following are critical to successfully demonstrating compliance:

  1. Produce a catalog of all impacted data
  2. Know where data is being used, and by whom
  3. Show governance lineage from Policy => Process => Standard => Control => Data
  4. Report on effectiveness of “Controls”
  5. Produce specific data related to particular requirements such as: Security Events, Notification, Privacy Impact Assessments, and so forth.
  6. Show the relationship of governance tasks to both data and the business processes that use Personal Information.
%d bloggers like this: