We need to think about ETL differently!

26 Jan

This blog was started to write about analytics – so here I go again on ETL! Seems that if you are working on Big Data things, it always starts with the data, and in many respects that is the thing that is most difficult – or perhaps requires the most wrenching changes – See this Creating an Enterprise Data Strategy for some interesting facts on data strategies.

ETL is a chore at the best of times. Analysts are generally in a rush to get data into a format that supports the analytical task of the day. Often this means taking the data from the data source and performing the data integration required to make the data analytically ready. This is often done at the expense of any effort by the data management folks to apply controls oriented at data quality issues.

This has created a tension between the data management side of the house and the analytical group. The data management folks are focused on getting data into an enterprise Warehouse or DataMart in a consistent format with data defined and structured in accordance with definitions and linkages defined through the data governance process. Analysts on the other hand – especially those engaged in adaptive type of analytical challenges – seem always to be looking at data through a different lens. Analysts often want to apply different entity resolution rules; want to understand new linkages (implies new schema); and, generally seek to apply a much looser structure to the data in order to expose insights that are often hidden by the enterprise ETL process.

This mismatch in requirements can be addressed in many ways. However, a key starting step is to redefine the meaning of ETL within an organization. I like the definition attributed to Michael Porter where he defines a “Lifecycle of Transformation” that shows how data is managed from the raw or source state through to application in a business context (Larger Image)

Value Chain of Transformation

Value Chain of Transformation

I am pretty sure that Michael Porter does not think of himself as an ETL person, and the article  (Page 14) I obtained this from indicates that this perspective is not ETL. However, I submit that the perspective that ETL stops once you have data in the Warehouse or the DataMart is just too limiting, and creates a false divide. Data must be both useable and actionable – not just useable. By looking at the ETL challenge across the entire transformation (does that make ETL TL TL TL …?), practitioners are more likely to meet the needs of business users.

Related discussions for future entries:

  • Wayne Eckerson has a number of articles on this topic. My favorite: Exploiting Big Data Strategies for Integrating with Hadoop by Wayne Eckerson; Published: June 1, 2012.
  • The limitations placed on analytics through the application of a schema independent of the analytical context is one of the drawbacks of “old school” RDBMS. The ability of a file based Hadoop / mapreduce oriented analytical environment to apply the schema later in the process is a key benefit of Hadoop/Mapreduce.
Advertisements

2 Responses to “We need to think about ETL differently!”

  1. benwright99 January 30, 2013 at 10:06 pm #

    I found this webcast on production operations with Hadoop to be informative.
    http://go.techtarget.com/r/20476732/15231288/1

Trackbacks/Pingbacks

  1. Primer on Big Data, Hadoop and “In-memory” Data Clouds | analyticaltern - August 25, 2013

    […] This is a good companion piece to the articles by Wayne Eckerson referenced in this post […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: