Archive by Author

23 February, 2018 09:53

23 Feb

http://result.ghanemrd.com

Ja

Advertisements

29 December, 2017 11:23

29 Dec

http://fill.vouchersme.com

Ja

29 December, 2017 08:57

29 Dec

http://release.vardhansolanki.com

Ja

Building Solid Foundations in Big Data & Analytics

23 Aug

Originally Published on the DATUM, LLC Site: Building Solid Foundations in a data Swamp


Much has been written about Big Data, Data Science and Artificial Intelligence and how these will change the world through the insights being derived from the data. This especially applies to the unstructured data. A recent article in the Harvard Business Review indicated that “cross-industry studies show that on average, less than half of an organization’s structured data is actively used in making decisions—and less than 1% of its unstructured data is analyzed or used at all.”[1]

There are a few challenges however:

  1. How do users create understanding and ensure they have the correct data for their needs if it has no structure?
  2. How do you create a single logical view of data in a big data world, where things are not only highly variable, but also are often widely disbursed.
  3. How do you address analytical requirements, where the notion of data quality and how it is managed, varies significantly?
  4. How do you expose the data lake(s) to users in a form that is discoverable, understandable and useable?

This blog is the first in a series to explore the data management and governance perspectives related to these four challenges.

Challenge #1: Unstructured Data

The question of how to deal with unstructured data consistently raises its head as a challenge for organizations. First let’s get a few things out there:

  • There is no such thing as truly unstructured data. There is always a structure of some sort.
  • Knowing what you have and having the right tools are foundational capabilities.
  • The degree of structure required for data to be useful is variable and context driven.

Let’s take these in order:

Creating Structure

Structure is created in one of two ways:

  • Through reorganizing data so that it has structure
  • Through labeling data

The former is what happens to data in a traditional data environment as it is moved through the ecosystem – from Source to Enterprise Data Warehouse for example. The latter is what happens in a big data environment. The data is never moved, but rather labels are added to it to provide the ability analyze that data.

Note: Data can be labeled incrementally. Newly acquired data, can only be labelled with the acquisition date, the source, and the file type. As data moves through the data lifecycle, it will be “curated” to add additional context.

A little labelling goes along way!

How much the data needs to be labelled to be useful can be viewed on a continuum. At one end simply knowing that you are looking at emails provides enough information to know how to organize them; while at the other end, social media sentiment analysis will require extensive labelling. Regardless, the right tools are required to provide logical structure to the unstructured data.

When it comes to tools that cater to unstructured data one key capability is entity tagging or entity extraction tools that can recognize an entity and tag it with a label that makes sense to the organization – essentially tag it with the approved glossary term. Entities can be:

  • Anything from a simple named list such as a “product”; or
  • Extremely complex and map entities into semantic ontologies such as a “JV” is a “Joint Venture”, which is a type of “Company”, which is an “organization” that has “owners”.

Complementing the tagging capability is a flexible indexing capability. Tools like Elastic Search allow users to search based on the structures discovered in the data.  For example, a “Joint Venture “is a type of company. Additionally, these tools can create an index to allow discovery of similarities in text.

The key point is that once data is organized, users and applications can begin to apply big data techniques to expose insights:

  • How do emails cluster on a timeline?
  • Are organizations mentioned in the text? (Could be Joint Ventures, Partnerships, LLCs, PLCs, and so on.)
  • Is there a change in frequency over time? Related to what entity types / categories?
What does this mean from a data management perspective?

From a data management perspective unstructured data will require some new capabilities. However, in some respects, it really is more of the same: What data do I have and where is it?  Is my data labelled to communicate understanding?  Is my data easy to acquire and apply in my context?

If you think of tags or labels as descriptive metadata, and the list of tags and labels as reference metadata, then you can place this activity into the traditional data management context. In order for data to be discovered, understood and integrated across systems and use cases, organizations need to:

  • Have a disciplined approach to how data is described and labelled. This starts with creating a set of glossary terms that can be linked to define meaning. [2]
  • Implement the governance framework that ensures the data is aligned to – and remains aligned to – the business understanding of what the data is, and how it is used.

Organizations often do not face this challenge until they need to manage data across the various operational silos, geographic regions or functional domains. The ned to understand product lifecycle data with regional focus group data is an example of a cross functional/geography/silo data mash up that delivers highly impactful insights.

Be sure to check back in as we address the next three challenges!

References

[1] Harvard Business Review What’s Your Data Strategy? Leandro DalleMule, Thomas H. Davenport; May –June 2017 Issue https://hbr.org/2017/05/whats-your-data-strategy

[2] With reference to linking of data, the simple link types are “subset of”, “superset of”, “same as”. (See SKOS for a deeper discussion on knowledge organization). For example, using this approach one can tag pharmaceutical products to identify synonyms as recognized by the ISO standards; and synonyms of the same product that are commercial names. This is the challenge faced by organizations implementing the IDMP standards.

[3] For a good case study of data integration across disparate data sets using SKOS metadata see Healthcare Research Information

Another Data Mart?

12 Jul

Martin’s Insights published the article below. It begs the questions – what to do? Clearly a CDP is created to solve an unmet need. The whatever the answer is for any given organization, data must be known “in context” and must be traceable back to its original form to survive scrutiny. Here is the article.

======================================

Recently you may have heard – from your business network or circle of marketing friends – that Customer Data Platforms (CDPs) is the new ‘black’. Can a CDP really be an all-rounded solution to marketing’s most pressing problem, when it comes to enhancing customer experience? Certainly, if you are in the BI field, the concept…

via Trend Alert – Customer Data Platforms — Martin’s Insights

Health Data Analytics 2016 — Martin’s Insights

29 Nov

I had the privilege and pleasure to attend HISA’s Health Data Analytics conference in Brisbane on 11 and 12 October 2016. What follows is this particular BI and Analytics consultant’s impressions and insights from the conference in terms of the main themes covered and the messages and impressions I take away, again from my particular…

via Health Data Analytics 2016 — Martin’s Insights

Business Framework for Analytics Implementation

3 Aug

In my previous post I discussed some analytical phrases that are gaining traction. Related to that I have had a number of requests for the deck that I presented at the Enterprise Dataversity  – Data Strategy & Analytics Forum.  I have attached the presentation here.

Also, while I am posting useful things that people keep asking for, here are a set of links that Jeff Gentry did on management frameworks for a Dataversity Webinar. Of particular interest to me was the mapping of the Hoshin Strategic Planning Framework to the CMMI Data Management Maturity Framework. The last link is the actual excel spreadsheet template.

Link to Webinar: Slides: http://www.dataversity.net/cdo-slides-cdo-interview-with-jeff-gentry-favorite-frameworks/

  1. Webinar Recording: http://www.dataversity.net/cdo-webinar-cdo-interview-with-jeff-gentry-favorite-frameworks/
  2. Link to Using Hoshin Frameworks: http://www.slideshare.net/Lightconsulting/hoshin-planning-presentation-7336617
  3. Hoshin Framework linked to DMM: http://content.dataversity.net/rs/656-WMW-918/images/Data Analytics Strategy and Roadmap Template 20160204D.xlsx

Forensic Analytics and the search for “robust” solutions

12 Jan

Happy New Year!

This entry has been sitting in my “to publish” file for some time. There is much more to be said on the topic. however, in the interest of getting it out … enjoy!

=======================================================

This entry was prompted by the article in the INFORMS ANALYTICS Magazine article titled Forensic Analytics: Adapting to a Growing Pandemic by Priti Ravi who is a senior manager with Mu Sigma and specializes “in providing analytics-driven advisory services to some of the largest retail, pharmaceutical and technology clients spread across the United States.”

Ms. Ravi writes a good article that left me hanging. Her conclusion was that the industry lacks access to sophisticated and intelligent monitoring equipment, and there exists a need for a “robust fraud management systems” that “offer a collective set of techniques” to implement a “complex adaptive approach.” I could not agree more. However, where are these systems? Perhaps even what are these systems?

Adaptive Approaches

To the last question first. What is a Complex Adaptive Approach? If you Google the phrase, the initial entries involve biology and ecosystems. However, wikipedia’s definition encompasses medicine, business and economics (amongst others) as areas of applicability. From an analytics perspective, I define complex adaptive challenges as those that  are impacted by the execution of the analytics – by doing the analysis, the observed behaviors change. This is inherently true of fraud as the moment perpetrators  understand (or believe) they can be detected, behavior will change. However, it also applies to a host of other type of challenges: criminal activity, regulatory compliance enforcement, national security; as well as things like consumer marketing and financial investment.

In an article titled Images & Video: Really Big Data the authors (Fritz Venter the director of technology at AYATA; and Andrew Stein the chief adviser at the Pervasive Strategy Group. define an approach they call “prescriptive analytics” that is ideally suited to adaptive challenges. They define prescriptive analytics as follows:

“Prescriptive analytics leverages the emergence of big data and computational and scientific advances in the fields of statistics, mathematics, operations research, business rules and machine learning. Prescriptive analytics is essentially this chain of transformations whereby structured and unstructured big data is processed through intermediate representations to create a set of prescriptions (suggested future actions). These actions are essentially changes (over a future time frame) to variables that influence metrics of interest to an enterprise, government or another institution.”

My less wordy definition:  adaptive approaches deliver a broad set of analytical capabilities that enables a diverse set of integrated techniques to be applied recursively.

What Does the Robust Solution Look Like?

Defining adaptive analytics this way, one can identify characteristics of the ideal “robust” solution as follows:

  • A solution that builds out a framework that supports the broad array of techniques required.
  • A solution that is able to deal with the the challenges of recursive processing. This is very data and systems intensive. Essentially for every observation evaluated, the system must determine whether or not the observation changes any PRIOR observation or assertion.
  • A solution that engages users and subject matter experts to effectively integrate business rules. In an environment where traditional predictive analytic models have a short shelf life (See Note 1), engaging with the user community is often the mechanism to quickly capture environmental changes. For example, in the banking world, tracking call center activity will often identify changes in fraud behavior faster than a neural network set of models. Engaging the User in the analytical process will require user interfaces, and data visualization approaches that are targeted at the user population, and integrate with the organization’s work processes. Visualization will engage non technical users to help them apply their experience and intuition to the data to expose insights. The census bureau has an interesting page, and if you look at Google Images, you can get an idea of visualization approaches.
  • A solution that provides native support for statistical and mathematical functions supporting activities associated with data mining : clustering, correlation, pattern discovery, outlier detection, etc.
  • A solution that structures unstructured data: categorize, cluster, summarize, tag/extract. Of particular importance here is the ability to structure text or other unstructured data into taxonomies or ontologies related to the domain in question.
  • A solution that persists data with the rich set of metadata required to support complex analytics. While it is clearer why unstructured data must be organized into a taxonomy / ontology, this also applies to structured data. Organizing data consistently across the variety of sources allows non obvious relationships to be exposed, and application of more complex analytical approaches.
  • A solution that is relatively data agnostic  – data will come from many places and exist in many forms. The solution must manage the diversity and provide a flexible way to integrate new data into the analytical framework.

What are Candidate Tools ?

And now to the second question: where are these tools? It is hard to find tools that claim to be “adaptive analytic” tools; or “prescriptive analytics” tools or systems in the sense that I have described them above. I find it interesting that over the last five years, major vendors have subsumed complex analytical capabilities into a more easily understandable components. Specifically, you used to be able to find Microsoft  Analytical Services easily on their site. Now it is part of MS SQL Server as SSAS; much the same way that the reporting service is now part of the database offer as SSRS (reporting services). There was a time a few years ago when you had to look really hard on the MS site to find Analytical Services. Of course since then Microsoft has integrated various BI acquisitions into the offer and squared away their marketing communication. Now their positioning is squarely around  BI and the database. Both of these concepts are easier to sell at the executive level, than the notion of prescriptive or adaptive analytics.

The emergence of databases and appliances optimized around analytics has simplified the message on the data side. everyone knows they need a database, and now they have one for analytics. At the decision maker level, that is a much easier decision than trying to figure out what kind of analytical approach the organization is going to adopt. People like Teradata have always supported analytics through the integration of SAS and now R as in-database functionality. However, Greenplum, Neteeza and others have incorporated SAS and the open source analytical “R” . In addition, we have seen the emergence (not new but much more talked about it seems) of the columnar database. The one I hear about most is the Sybase IQ product; although there have been a number of posts on the topic on here, here, and here.

My point here is that vendors have too hard a time selling complex analytical solutions, and have subsumed the complex capabilities into the concepts that are easier to package, position and communicate around; namely; database products and Business Intelligence products. The following are product sets that are candidates for the integrated approach. We start with the big players first and work towards that are less obviously candidates.

SAS

The SAS Fraud Framework provides an integration of all the SAS components that required to implement a comprehensive analytics solution around adaptive challenges (all kinds of fraud, compliance, money laundering, etc. as examples). This is a comprehensive suite of capabilities that spans all activities: data capture, ingest, and quality; analytics tools (including algorithm libraries), data visualization and reporting / BI capabilities. Keep in mind that SAS is a company that sells the building blocks, and the Fraud Framework is just that, a framework within which customers can build out capabilities. This is not a simple plug and play implementation process. It takes time and investment and the right team within the organization. The training has improved, and it is now possible to get comprehensive training.

As with any implementation of SAS, this one comes with all the caveats associated with comprehensive enterprise systems that integrate  analytics into the fabric of an organization. The Gartner 2013 BI report indicates that SAS “very difficult to implement”. This theme echoes across the product set.  Having said that   when it comes to integrated analytic of the kind we have been discussing all, of the major vendors suffer from the same implementation challenges – although perhaps for different reasons.

Bottom line however, is that SAS is a company grounded in analytics – the Fraud Framework has everything needed to build out a first class system. However, the corporate culture builds products for hard core quants, and this is reflected in the Gartner comments.

IBM

IBM is another company that has the complete offer. They have invested heavily in the analytics space, and between their ETL tools; the database/ appliance and Big Data capabilities; the statistical product set that builds off SPSS; and, the Cognos BI suite users can build out the capabilities required. Although these products are being integrated into a seamless set of capabilities, they remain somewhat separate and this probably explains some of the implementation challenges reports. Also, the product side of the IBM operation does not necessarily speak with the Global Services side of the house.

I had thought when IBM purchased Systems Research & Development (SRD) in 2005 that they were going to build out capabilities that SRD and Jeff Jonas had developed. Jeff heads up the Entity Analytics group within IBM Research, and his blog is well worth the read. However, the above product set appears to have remained separated from the approaches and intellectual knowledge that came with SRD. This may be on purpose – from a marketing perspective, buy the product set, and then buy IBM services to operationalize the system is not a bad approach.

Regardless, as the saying goes, no one ever got fired for buying IBM” probably still holds true. However, like SAS beware of the implementation! Any one of the above products (SPSS, Cognos, and Infosphere) require attention when implementing. However, when integrating as an operational whole, project leadership needs to ensure that expectations as to the complexity and time frame are communicated.

Other Products

There are many other product sets and I look forward to learning more about them. Once I post this, someone is going to come back and mention “R” and other open source products. There are plenty out there. However, be aware that while the products may be robust, many are not delivered as an integrated package.

With respect to open source tools, it is worth noting that the capabilities inherent in Hadoop – and the related products, lend themselves to adaptive analytics in the sense that operators can consistently re-link and re-index on the fly without having to deal with where and how the data is persisted. This is key in areas like signals intelligence, unstructured data analysis, and even structured data analysis where the notion of semantic equivalence is shifting. This is a juicy topic all by itself and worthy of a whole blog entry.

Notes:

  1. Predictive analytics relies on past observations to predict future observations. In an adaptive environment, the inputs to those predictive models continually change as a result of the outputs using the past observations.

The merging of analytics and transactional data platforms requires more than just an upgrade in technology!

15 Sep

This IDC white paper puts the evolution of data platforms into layman’s terms. My take away is that the unshackling of information architects and applications from the constraints of the traditional RDBMS will continue. Many of the design choices that the article details are grounded in the historic limitations of the data platform. The comments made under the Future Outlook segment are key:

“Trying to make definitive statements about the state of analytic-transaction data platforms going forward is challenging, because both the database kernel technology and the hardware on which it runs are evolving at a rapid pace. In addition to this, new workloads and mounting performance requirements add even more to the pace of development. It is safe to say that all the technology described in this study, admittedly in a very abstract manner, may be described as transitional technology that is evolving quickly. New approaches to data structures, new optimizations for transactional data once it is fully freed from the constraints of disk optimization, new ways of organizing processors and memory, and the introduction of non-volatile dual in-line memory modules (NVDIMMs) all will no doubt result in technologies within 10 years that are very different from what is described here.

While platforms and technologies are evolving (this discussion has additional detail here), I find the juxtaposition of the “ideal” view presented here and the reality of most data operations interesting. This article provides “Essential Guidance” focused on IT buyers and guidance on choosing the right technology platform.

The focus on hardware and technology tends to obscure an equally important part of the buying equation – namely can managers manage these new technologies to achieve the desired business impacts and resulting business benefits. For the most part the answer is a resounding – NO. For these “next gen” implementations to work, organizations need to not only upgrade their platforms, but also their management practices. The balance of this blog entry examines some of the areas that the IDC article focuses on from the management perspective of the Chief Data Officer or Enterprise Information Architect.

The Enterprise Data Warehouse. Traditionally the Enterprise Data Warehouse (EDW) has been considered the repository of the “single version of the truth”. However, when it comes to analytics – and melding the transactional data store with analytics, this is a hard concept. There is no one version of the truth – everything is context driven. The design alternatives presented in the article (See Figure below) enable this in that they generally store both the transactional (source) and the fully resolved EDW version. This allows users to hit both the transactional store AND the EDW depending on the context they seek and how they want to interact with the data. Implicit in this view is that the context is captured and in a machine exploitable form that enables users to derive their own “single version of the truth”. This is a function of metadata discussed below. Additionally the article recognizes that the “one large database” solution is not generally a viable alternative; the issue being one of “manageability and agility.” This is somewhat contradicted in the opening “opinion” section in that they talk about a canonical data model. However, I am going to assume that the canonical recommendation is related to the metadata and not the content.

In all of the platform options discussed in the paper (see below), data managers need to keep track of a transactional data and data within a fully resolved EDW. The context and the semantic meaning of the content of both of those data sources needs to be managed, cross walked, and communicated to the user community. This will involve an evolution in both management practices and tools.

IDC Graphic on Data PLatforms

Metadata. I like the way this paper addresses metadata:

“Metadata, including all data models and schemas in the relevant databases or data collections, must be harmonized, kept current with those databases, and mapped to higher order constructs, including a business glossary and, for data managed in common, a canonical data model, in order to facilitate the access and management of the data.”

The notion of mapping “higher order constructs” is key. While it is not always possible or feasible to create a canonical data model, it is very feasible to create a canonical metadata model (metamodel). This give you a consistent way to fully describe your data regardless of the physical form it takes, and link it to higher order constructs referred to. My article here talks to the role the enterprise plays in managing the metadata at the enterprise level.

Managing the Evolution. The architectures discussed in the paper all require an evolution from the transactional data stores that exist today towards platforms that can respond to business needs rapidly, and with little or no latency. The “Type 5” platform in Figure 1 is the “Data Lake” that has become such a buzzword. In this configuration, there is a single data structure for both transactions and analytics. The ETL functions, number of indexes, and flexibility that can be applied to render the data all place a larger burden on the governance disciplines. Additionally, the process by which the organization integrates the business and IT activities requires formalizing in a way that breaks down the traditional silos.

Hampering the evolution at some level is the fact that the tool suites are not entirely intuitive. Tools to handle the mapping of the higher order constructs (concepts systems; ontologies; taxonomies, reference data…), and the management of multiple dictionaries cannot easily be implemented without complex configuration and often coding. The tool vendors seem to be coming along, but many are still working to apply governance and curation within the context of table based systems. The reality is that to create fully described data that is linked to higher order constructs, and to manage these relationships requires a collection of tools that must be configured to address your environment. It is not yet easy.

The Way Forward. Previously I have made the comment that the Information Architect, Enterprise Data Management Office, or CDO must initially focus on creating a tangible value proposition for the business side of the house. As long as data management is perceived as a function related to standards, governance and “protocol” it will be perceived as slowing down the business and getting in the way of achieving business goals. This article details a scoped down set of goals that lay the foundation for that initial value proposition. Once the enterprise data management function is able to make the case they actually improve business operations, and impact key success metrics (i.e. revenue), what next?

This is where all the articles regarding CDO’s seem to agree. The next step is all about outreach and engagement with the broader business community – potentially internal and external to the organization. My recommendation here is to perform this activity using a framework that ensures the discussions stay focused on goals, practices, and result in actionable, measurable and prioritized recommendations. The CMMI Data Management Maturity Model (DMM) is one such framework. I am biased, admittedly as I helped create it, but for an independent opinion Bob Lambert at CapTech wrote a review that speaks volumes. The framework is used to engage in a series of workshops. These workshops serve to identify a maturity level, but more importantly identify the business priorities and concerns as detailed by the workshop participants. This is critical as the resulting recommendations inherently have buy-in from across the organization.

Because the Data Management Model evaluates capabilities at the “practice” level (i.e. what people actually do), it inherently details the next steps in terms of recommendations; in other words – do not try to create a semantically equivalent data model across the whole organization if you cannot even do it for a business unit or a project! Additionally, the model recognizes the relationships between functions. The end result is a holistic and integrated set of guidance for the overall data management strategy and implementation roadmap.

Organizations seeking to upgrade their data platforms to more closely resemble the “Analytic Transactional data platform” that enables the real-time enterprise as discussed in the IDC white paper will have greater success more quickly if they evolve their data management practices at the same time.

%d bloggers like this: