Archive | ETL RSS feed for this section

Data Prep – More than a Buzzword?

25 Feb

“Data Prep” has become a popular phrase over the last year or so – why? At a practical level, data preparation tools are providing the same functionality that traditional ETL (extract, transform, load) tools provide. Are data prep tools just a marketing gimmick to get organizations to buy more ETL software? This blog seeks to address why data prep capabilities have become a topic of conversation within the data and analytics communities.

Traditionally, data prep has been viewed as slow and laborious, often associated with linear, rigid methodologies. Recently, however, data prep has become synonymous with data agility. It is a set of capabilities that pushes the boundaries of who has access to data, and how they can apply it to business challenges. Looked at this way, data prep is a foundational capability for digital transformation, which I define as the ability of companies to evolve in an agile fashion in some key dimension of their business model. The business driver of most transformation programs is to fundamentally change key business performance metrics, such as revenue, margins, or market share. Viewed in this way, data prep tools are a critical addition to the toolbox when it comes to driving key business metrics.

Consider the way that data usage has evolved, and the role that data prep capabilities are playing.

Analytics is maturing. Analytics is not a new idea. However, for years it was a function relegated to Operations Research (OR) folks and statisticians. This is no longer the case. As BI and reporting tools grew more powerful and increasingly enabled self service for end users, users began asking questions that were more analytical in nature.

Data-Driven decisions require data “in context.” Decision-making and the process that supports it require data to be evaluated in the context of the business or operational challenge at hand. How management perceives an issue will drive what data is collected and how it is analyzed. In the 1950’s and 1960’s, operations research drove analytics, and the key performance indicators were well established. These included time in process, mean time to failure, yield and throughput. All of these were well understood and largely prescriptive. Fast forward to now. Analytics is broadly applied and used well beyond the scope of operations research. New types of analysis driven in large part by social media trends are much less prescriptive and value is driven by context. Examples include: key opinion leader, fraud networks, perceptual mapping, and sentiment analysis.

Big data is driving the adoption of machine learning. Machine learning requires the integration of domain expertise with the data in order to expose “features” within the data that enhance the effectiveness of machine learning algorithms. The activity that identifies and organizes these features is called “feature engineering.” Many data scientists would not equate “data preparation” with feature engineering, yet there is a strong correlation to what an analyst does. A business analyst invariably creates features as they prepare their data for analysis: 1) observations are placed on a time line; 2) revenue is totaled by quarters and year; 3) customers are organized by location, by cumulative spend, and so on. Data Prep in this context is the organization of data around domain expertise, and is a critical input to the harnessing of big data through automation.

Data science is evolving and data engineering is now a thing. Data engineering focuses on how to apply and scale the insights from data science into an operational context. It’s one thing for a data scientist to spend time organizing data for modest initiatives or limited analysis, but for scaled up operational activities involving business analysts, marketers and operational staff, data prep must be a capability that is available to staff with a more generalized skill set. Data engineering supports building capabilities that enable users to access, prepare and apply data in their day-to-day lives.

“Data Prep” in the context of the above is enabling a broader community of data citizens to discover, access, organize and integrate data into these diverse scenarios. This broad access to data using tools that organize and visualize is a critical success factor for organizations seeking the business benefits of digitally enabling their organization. Future blogs will drill down on each of the above to explore how practitioners can evolve their data prep capabilities and apply them to business challenges.


Magic Quadrant for Data Integration Tools

23 Feb

Gartner Data Integration Survey

October 2012 – All the normal suspects. However, was surprised (and pleased) to see Talend in the mix. Interesting to note that SAS is in the lead with the number of installs (13k) – up there with Microsoft  (12k).

We need to think about ETL differently!

26 Jan

This blog was started to write about analytics – so here I go again on ETL! Seems that if you are working on Big Data things, it always starts with the data, and in many respects that is the thing that is most difficult – or perhaps requires the most wrenching changes – See this Creating an Enterprise Data Strategy for some interesting facts on data strategies.

ETL is a chore at the best of times. Analysts are generally in a rush to get data into a format that supports the analytical task of the day. Often this means taking the data from the data source and performing the data integration required to make the data analytically ready. This is often done at the expense of any effort by the data management folks to apply controls oriented at data quality issues.

This has created a tension between the data management side of the house and the analytical group. The data management folks are focused on getting data into an enterprise Warehouse or DataMart in a consistent format with data defined and structured in accordance with definitions and linkages defined through the data governance process. Analysts on the other hand – especially those engaged in adaptive type of analytical challenges – seem always to be looking at data through a different lens. Analysts often want to apply different entity resolution rules; want to understand new linkages (implies new schema); and, generally seek to apply a much looser structure to the data in order to expose insights that are often hidden by the enterprise ETL process.

This mismatch in requirements can be addressed in many ways. However, a key starting step is to redefine the meaning of ETL within an organization. I like the definition attributed to Michael Porter where he defines a “Lifecycle of Transformation” that shows how data is managed from the raw or source state through to application in a business context (Larger Image)

Value Chain of Transformation

Value Chain of Transformation

I am pretty sure that Michael Porter does not think of himself as an ETL person, and the article  (Page 14) I obtained this from indicates that this perspective is not ETL. However, I submit that the perspective that ETL stops once you have data in the Warehouse or the DataMart is just too limiting, and creates a false divide. Data must be both useable and actionable – not just useable. By looking at the ETL challenge across the entire transformation (does that make ETL TL TL TL …?), practitioners are more likely to meet the needs of business users.

Related discussions for future entries:

  • Wayne Eckerson has a number of articles on this topic. My favorite: Exploiting Big Data Strategies for Integrating with Hadoop by Wayne Eckerson; Published: June 1, 2012.
  • The limitations placed on analytics through the application of a schema independent of the analytical context is one of the drawbacks of “old school” RDBMS. The ability of a file based Hadoop / mapreduce oriented analytical environment to apply the schema later in the process is a key benefit of Hadoop/Mapreduce.
%d bloggers like this: