Archive | Data Management RSS feed for this section

The merging of analytics and transactional data platforms requires more than just an upgrade in technology!

15 Sep

This IDC white paper puts the evolution of data platforms into layman’s terms. My take away is that the unshackling of information architects and applications from the constraints of the traditional RDBMS will continue. Many of the design choices that the article details are grounded in the historic limitations of the data platform. The comments made under the Future Outlook segment are key:

“Trying to make definitive statements about the state of analytic-transaction data platforms going forward is challenging, because both the database kernel technology and the hardware on which it runs are evolving at a rapid pace. In addition to this, new workloads and mounting performance requirements add even more to the pace of development. It is safe to say that all the technology described in this study, admittedly in a very abstract manner, may be described as transitional technology that is evolving quickly. New approaches to data structures, new optimizations for transactional data once it is fully freed from the constraints of disk optimization, new ways of organizing processors and memory, and the introduction of non-volatile dual in-line memory modules (NVDIMMs) all will no doubt result in technologies within 10 years that are very different from what is described here.

While platforms and technologies are evolving (this discussion has additional detail here), I find the juxtaposition of the “ideal” view presented here and the reality of most data operations interesting. This article provides “Essential Guidance” focused on IT buyers and guidance on choosing the right technology platform.

The focus on hardware and technology tends to obscure an equally important part of the buying equation – namely can managers manage these new technologies to achieve the desired business impacts and resulting business benefits. For the most part the answer is a resounding – NO. For these “next gen” implementations to work, organizations need to not only upgrade their platforms, but also their management practices. The balance of this blog entry examines some of the areas that the IDC article focuses on from the management perspective of the Chief Data Officer or Enterprise Information Architect.

The Enterprise Data Warehouse. Traditionally the Enterprise Data Warehouse (EDW) has been considered the repository of the “single version of the truth”. However, when it comes to analytics – and melding the transactional data store with analytics, this is a hard concept. There is no one version of the truth – everything is context driven. The design alternatives presented in the article (See Figure below) enable this in that they generally store both the transactional (source) and the fully resolved EDW version. This allows users to hit both the transactional store AND the EDW depending on the context they seek and how they want to interact with the data. Implicit in this view is that the context is captured and in a machine exploitable form that enables users to derive their own “single version of the truth”. This is a function of metadata discussed below. Additionally the article recognizes that the “one large database” solution is not generally a viable alternative; the issue being one of “manageability and agility.” This is somewhat contradicted in the opening “opinion” section in that they talk about a canonical data model. However, I am going to assume that the canonical recommendation is related to the metadata and not the content.

In all of the platform options discussed in the paper (see below), data managers need to keep track of a transactional data and data within a fully resolved EDW. The context and the semantic meaning of the content of both of those data sources needs to be managed, cross walked, and communicated to the user community. This will involve an evolution in both management practices and tools.

IDC Graphic on Data PLatforms

Metadata. I like the way this paper addresses metadata:

“Metadata, including all data models and schemas in the relevant databases or data collections, must be harmonized, kept current with those databases, and mapped to higher order constructs, including a business glossary and, for data managed in common, a canonical data model, in order to facilitate the access and management of the data.”

The notion of mapping “higher order constructs” is key. While it is not always possible or feasible to create a canonical data model, it is very feasible to create a canonical metadata model (metamodel). This give you a consistent way to fully describe your data regardless of the physical form it takes, and link it to higher order constructs referred to. My article here talks to the role the enterprise plays in managing the metadata at the enterprise level.

Managing the Evolution. The architectures discussed in the paper all require an evolution from the transactional data stores that exist today towards platforms that can respond to business needs rapidly, and with little or no latency. The “Type 5” platform in Figure 1 is the “Data Lake” that has become such a buzzword. In this configuration, there is a single data structure for both transactions and analytics. The ETL functions, number of indexes, and flexibility that can be applied to render the data all place a larger burden on the governance disciplines. Additionally, the process by which the organization integrates the business and IT activities requires formalizing in a way that breaks down the traditional silos.

Hampering the evolution at some level is the fact that the tool suites are not entirely intuitive. Tools to handle the mapping of the higher order constructs (concepts systems; ontologies; taxonomies, reference data…), and the management of multiple dictionaries cannot easily be implemented without complex configuration and often coding. The tool vendors seem to be coming along, but many are still working to apply governance and curation within the context of table based systems. The reality is that to create fully described data that is linked to higher order constructs, and to manage these relationships requires a collection of tools that must be configured to address your environment. It is not yet easy.

The Way Forward. Previously I have made the comment that the Information Architect, Enterprise Data Management Office, or CDO must initially focus on creating a tangible value proposition for the business side of the house. As long as data management is perceived as a function related to standards, governance and “protocol” it will be perceived as slowing down the business and getting in the way of achieving business goals. This article details a scoped down set of goals that lay the foundation for that initial value proposition. Once the enterprise data management function is able to make the case they actually improve business operations, and impact key success metrics (i.e. revenue), what next?

This is where all the articles regarding CDO’s seem to agree. The next step is all about outreach and engagement with the broader business community – potentially internal and external to the organization. My recommendation here is to perform this activity using a framework that ensures the discussions stay focused on goals, practices, and result in actionable, measurable and prioritized recommendations. The CMMI Data Management Maturity Model (DMM) is one such framework. I am biased, admittedly as I helped create it, but for an independent opinion Bob Lambert at CapTech wrote a review that speaks volumes. The framework is used to engage in a series of workshops. These workshops serve to identify a maturity level, but more importantly identify the business priorities and concerns as detailed by the workshop participants. This is critical as the resulting recommendations inherently have buy-in from across the organization.

Because the Data Management Model evaluates capabilities at the “practice” level (i.e. what people actually do), it inherently details the next steps in terms of recommendations; in other words – do not try to create a semantically equivalent data model across the whole organization if you cannot even do it for a business unit or a project! Additionally, the model recognizes the relationships between functions. The end result is a holistic and integrated set of guidance for the overall data management strategy and implementation roadmap.

Organizations seeking to upgrade their data platforms to more closely resemble the “Analytic Transactional data platform” that enables the real-time enterprise as discussed in the IDC white paper will have greater success more quickly if they evolve their data management practices at the same time.

Evaluating Different Persistence Methods as part of the Planning Process

4 Jul

Every once in a while, I get asked about how to select between different types of databases. Generally, this comment is as a result of a product vendor or consultant making a recommendation to evolve towards a Big Data solution. The issue is twofold in that companies seek to understand what the next generation data platform looks like; AND, how or if their current environment can evolve. This involves understanding the pros and cons of the current product set and to what degree they can exist with newer approaches – Hadoop being the current platform people talk about.

The following is a list of data persistence approaches that helps at least define the options. This was done some time ago, so I am sure the vendors shown have evolved. However, think of it as a starting point to frame the discussion.

In general, one wants to anchor these discussions in some defined criteria that can help frame the discussion within the context of  business drivers. In the following figure, the goal is to show that as data sources and consumers of your data expand to include increasingly complex data structures and “contexts,” there is a need to evolve approaches beyond the traditional relational database (RDBMS) approaches. Different organizations will have different criteria. I provide this as a rubric that has worked before – you will need to create an approach that works for your organization or client.

Evolution of Data Persistence

A number of data persistence approaches  support the functional components as defined. These are described below.

Defined Pros / Cons Vendor examples
Relational Databases

(Row Orientation)

Traditional normalized data models optimized for efficiently storing data Relational structures are best used when the data structures are known and change infrequently. Relational designs often present challenges for analysts when queries and joins are executed that are incompatible with the design schema and / or indexing approach. This incompatibility creates processing bottlenecks, and resource challenges resulting in delays for data management teams. This approach is challenged when dealing with complex semantic data where multiple levels of parent / child relationships exist.

Advantages: This approach is best for transactional data where the relationships between the data and the use cases driving how data is accessed and used are stable. In uses where relational integrity is important and must be enforced in a consistent manner, this approach can work well. In a row based approach, contention on record locking are easier to manage than other methods.

Disadvantages: As the relationships between data and relational integrity are enforced through the application of a rigid data model, this approach is inflexible, and changes can be hard to  implement.

All major database vendors: IBM – DB2; Oracle; MS SQL and others
Columnar Databases

(Column Oriented)

Data organized  or indexed around columns; can be implemented in SQL or a NoSQL environments. Advantages: Columnar data designs lend themselves to analytical tasking involving large data sets where rapid search, retrieval and aggregation type queries are performed on large data tables. A columnar approach inherently creates vertical partitioning across the datasets stored this way. It is efficient and scalable.

Disadvantages: efficiencies can be offset by  the need to join many queries to obtain the desired result.

•Sybase IQ

•InfoBright

•Vertica (HP)

•Par Accel

•MS SQL 2012

Defined Pros / Cons Vendor examples
RDF Triple Stores / Databases Data stored organized around RDF triples (Actor-action-object OR Subject-predicate-Object); can be implemented in SQL or a NoSQL environments. Advantages: A semantic organization of data lends itself to analytical and knowledge management tasks where the understanding of complex and evolving relationships is key. This is especially the case where ontologies or SKOS (1) type relationships are required to organize entities and their relationships to one another: corporate hierarchies/networks; insider trading analysis for example. This approach to organizing data is often represented in the context of the “semantic web” whose organizing constructs are RDF and OWL. when dealing with complex semantic data where multiple levels of parent / child relationships exist, this approach is more efficient that RDBMS

Disadvantages: This approach to storing data is often not as efficient as relational approaches. It can be complicated to write queries to traverse complex networks – however, this is often not much easier in relational databases either.

Note: these can be implemented with XML  formatting or in some other form.

Native XML  / RDF Databases

•Marklogic (COTS)

•OpenLink Virtuoso (COTS)

•Stardog (o/s, COTS)

•BaseX  (o/s)

•eXist  (o/s)

•Sedna (o/s)

XML Enabled Databases

•IBM DB2

•MS SQL

•Oracle

•PostgrSQL

XML enabled databases deal with XML as a CLOB in a table or organized into tables based on a schema

Graph Databases A database that uses graph structures to store data. See XML / RDF Stores / Databases. Graph Databases are a variant on this theme.

Advantages:  Used primarily to store information on networks. Optimized for iterative joins; often in a recursive process (2)..

Disadvantages: Storage challenges – these are large datasets; builds through iterative joins – very processor intensive.

•ArangoDB

•OrientDB

•Cayley

•Aurelius Titan

•Aurelius Faunus

•Stardog

•Neo4J

•AllegroGraph

(1) SKOS = Simple Knowledge Organization Structure. Relationships can be expressed as triples; examples are “is part of”; “is similar to”

(2) Recursion versus iteration

Defined Pros / Cons Vendor Examples
NoSQL

File based storage – HDFS

Data structured to expose  insights through the use of “key pairs” This has many of the characteristics of the XML, Columnar and Graph approaches. In this instance, the data is loaded, and key value pair (KVP) files created external to the data. Think of the KVP as an index with a pointer back to the source data. This approach is generally associated with the Hadoop / MapReduce  capabilities, and the definition here assumes that KVP files are queried using the capabilities available in the Hadoop ecosystem

Advantages: flexibility; MPP capabilities; speed; schema-less; scalable; Great at creating views of data; and performing simple calculations across Big Data; significant open source community – especially through the Apache Foundation. Shared nothing architecture optimizes the read process. However, it creates challenges in meeting ACID (1) requirements. File based storage systems adhere to the BASE (2) requirements

Disadvantages: Share nothing architecture creates complexity in uses where sequencing of transactions or writing data is important – especially when multiple nodes are involved; complex metadata requirement; few tool “packages” available to support production environments; relatively immature product set.

Document Store

•Mongo DB

•Couch DB

Column Store

•Cassandra

•Hbase

•Accumulo

Key Value Pair

•Redis

•Riak

(1) ACID = Atomicity; Consistent; Isolated; Durable. Used for Transaction processing systems.

(2) BASE = Basic Availability, Soft State; Eventual Consistency. Used for distributed parallel processing systems where maintaining complete consistency is often prohibitively expensive

Defined Pros / Cons Vendor examples
In-Memory Approaches Data approaches where the data is loaded into active memory to improve efficiency Note that multiple persistence approaches can be implemented in memory

Advantages: Speed; flexibility – ability to virtualize views and calculated / derived tables; think of Datamarts in the traditional BI context

Disadvantages: Hardware, cost

•SAP HANA

•SAS High Performance Analytics

•VoltDB

The classes of tools below are presented as they provide alternatives for capabilities that are likely to be required. Many of the capabilities are resident in some of the tool sets already discussed.
Data Virtualization The ability to produce tables or views without going through an ETL process Data  virtualization is a capability built into other products. Any In- Memory product inherently virtualizes data. Likewise a number of the Enterprise BI tools allow data – generally in the form of “cubes” to be virtualized. Denodo Technologies is the major pure play vendor. The others vendors generally provide products that are part of larger suites of tools. •Composite Software (Cisco)

•Denodo Technologies

•Informatica

•IBM

•MS

•SAP

•Oracle

Search Engines Data management components that are used to search structured and unstructured data Search engines and appliances perform functions as simple as indexing data, and as complex as Natural Language Processing (NLP) and entity extraction. They are referenced here as the functionality can be implemented as stand alone capability and may be considered as part of the overall capability stack. •Google Search Appliance

•Elastic Search

Defined Pros / Cons Vendor examples
Hybrid Approaches Data products that implement both SQL and NoSQL approaches These are traditional SQL database approaches that have been partnered with one or more of the approaches defined above. Teradata acquired Aster to create a “bolt on” to a traditional SQL Db; IBM has Db2/Netezza/Big Insights. SAS uses a file based storage system and has created “Access Modules” that work though Apache HIVE to apply analytics within either an HDFS environment, or the SAS environment.

Another hybrid approach is exemplified by Cassandra that incorporates elements of a data model within a HDFS based system.

One also sees organizations implementing HDFS / RDBMS solutions for different functions. For example acquiring, landing and staging data using an HDFS approach, and then once requirements and the business use is known creating structured data models to facilitate and control delivery

Advantages: Integrated solutions; ability to leverage legacy; more developed toolkits to support production operations. Compared to open source, production ready solutions require less configuration and code development.

Disadvantages: Tend to be costly; architecture tends to be inflexible – all or nothing mindset.

•Teradata

•EMC

•SAS

•IBM

•Cassandra (Apache)

Formalizing and optimizing your risk architecture within the BCBS context

1 Jul

This article was interesting to me for two reasons: 1) It formalized a data view of risk management within the financial community around the context of BCBS (and for BCBS 239 related to Risk Data Aggregation and Reporting); and, 2) it provided an interesting perspective on governance / data quality and associated metrics.

This paragraph sums it up:

“What is actually happening in practice is that each major institution’s regional banks are lobbying/negotiating with their local/regional regulators to agree on an initial form of compliance – typically as some form of MS Word or PowerPoint presentation to prove that that they have an understanding of their risk data architecture. One organization might write up 100 page tome to show its understanding – and another might write up 10. The “ask” is vague and the interpretation is subjective. Just what is adequate?”

The perspective on governance that  the article proposed is a way to “systematically compare different architectures” with a set of metrics that was understandable, obtainable, and actionable.

Have a read – let me know your thoughts. The article also provides a nice summary of the BCBS requirements.

Old School vs. new school – Its Both!

24 Oct

Excellent article by Wayne Eckerson (most of his are) . We give Data Warehouses a bad name because they have been implemented in a way that does not meet the businesses needs – certainly not from an analytical perspective. HOWEVER, the business reasons that they exist remain, and this is Wayne’s point. I have been watching the shouting match between Inmon and Kimball.  I think they are both wrong – the answer is not as simple as they make it out to be – our world will be hybrid SQL/RDBMS and NoSQL and everything will need to play nice together! Those are my words  of wisdom on a Friday 🙂

Gartner Magic Quadrant for Operational Database Management Systems is out

11 Nov

http://www.gartner.com/technology/reprints.do?id=1-1M9YEHW&ct=131028&st=sb

I had a conversation with some one the other day, and we agreed that there was no “front end” for hadoop / NO-SQL type data environments. This seems to be a big issue in terms of these systems taking front and center from an operational perspective. More to follow on this.

Analytics keeps moving closer to the data!

18 Oct

http://feedproxy.google.com/~r/dbms2/feed/~3/QOuK0EQFRzs/

Note the list of partners – all have a background in visualization and analyst driven capabilities – not big data munging. Where does this leave the companies that are neither visualization, nor database companies? Companies like SAS.

%d bloggers like this: