Archive | Big Data RSS feed for this section

Automating Data Management and Governance through Machine Learning

I fairly consistently get questions on the value of Machine Learning in data management and governance. Sometimes this question is framed at a high level in a very “buzz wordy” way. The person asking the question may not know what machine learning (ML) is. They have just heard the words so many times that they know it is good and should be part of the discussion. At other times, the person asking the question knows about ML and various other analytical techniques, but has never really thought of ML in the context of a data management tool. The challenge is that the emergence of IoT data, Customer 360 programs, and emerging best practices that focus on sharing semantically tagged data, all contribute to a fundamental need to do things differently. Machine learning is one of the tools in the toolbox to address the challenges related to scale, change velocity, and the consistent evolution of users and their use cases.

This post focuses on how we can automate the process of identifying data, classifying it, and linking it to internal and external references to provide semantic meaning. The goal of this post is simply to describe what machine learning is for the data manager, and what tasks it performs in the context of the standards based operational perspective.

From an operational perspective, the figure below presents the evolution of data from the “raw” transactional state to a highly labelled or curated state that can be shared between purchaser and vendor; or indeed any producer or consumer of data. Machine Learning plays a role in automating how data is curated and enriched across this lifecycle.

Figure 1: The Curation from raw data to sharable Information

If we drill down on the curation lifecycle, we can identify the various repositories that would be required, and a few of the key supporting standards. These standards and their roles are discussed more completely in a follow on post.

Figure 2: The Curation flow across the various repositories required. NOTE that this functional perspective is discussed in the context of a metadata hub in Enterprise Data Management – Where to Start?

The database symbols outlined in blue (solid lines) represent data at rest. The rectangular items outlined in green (dashed lines) represent tasks that automate how data is augmented as it moves along this path. The focus of this discussion is on these green boxes.

Activities within the Data Quality Rules and MDM Rules tasks can be broken down into a number of functional capabilities as detailed below. Some of these capabilities are traditional data operations tasks; namely, persisting metadata in a database, and exposing the data through some sort of cataloging and publishing capability. The other items (outlined in blue) are those where machine learning approaches can be applied.

Figure 3: Functional capabilities supported by Machine Learning

First let’s start with a definition of Machine Learning as Machine Learning has multiple definitions within the popular literature. The website Techemergence provides a comprehensive definition:

“Machine Learning is the science of getting computers to learn and act like humans do, and improve their learning over time in autonomous fashion, by feeding them data and information in the form of observations and real-world interactions.”

Machine learning techniques play a major role in automating the process detailed above especially over unknown or new data sets.

For data management practitioners it is important to understand that no one machine learning technique is going to apply. In all likelihood multiple approaches will be chained together and invariably executed recursively to ensure that the data can be identified, classified and then linked to the appropriate unique identifier. In the ideal world, the algorithms will change or learn to accommodate changes in the data being classified. The figure below lists some of the machine learning techniques that may be applied.

Machine Learning Techniques

Unstructured Data

Structured Data

· Entity Tagging / Extraction

· Categorize

· Cluster

· Summarize

· Tag

· Linking

· Associate

· Characterize

· Classify

· Predict

· Cluster

· Pattern Discovery

· Exception Analysis

Note that these invariably interact with one another. If I tag people entities within unstructured text, I may wish to characterize them using structured technique: count of male names; frequency per document; frequency across documents, etc. This speaks to the layered and recursive nature of machine learning, and the richness of the metadata that the data team will need to manage. For a more technical view of ML techniques see this summary.

These are detailed below with considerations for program managers.

Capability	Considerations
Identify	Machine Learning approaches support the identification of instance data in order to classify the data. Is this personal Information? Does it look like a financial #? Does it reside in a financial statement? For organizations where there is a significant installed legacy challenge. It will be important to have algorithms that identify data of interest. The identification of personal information is a current area of interest driven by the GDPR regulation.
Classify	Once data is identified, ML approaches support classifying the data within the data dictionary: data is in finance domain; it is in the “Deliver” phase of the Supply Chain Operations Reference (SCOR) lifecycle; etc. Classification algorithms must exist that tag the data with the appropriate classifier. Capabilities must quantify and resolve those instances where there is uncertainty as to accuracy of the classification algorithm. For example, are we are 100% certain that this is a vendor and not a customer?
Resolve	The completed data dictionary will support entity resolution by providing a richer feature set against which MDM machine learning algorithms can be run. Resolving the identity of the master data element may require a multi-tiered approach be run iteratively: apply Algorithm #1; for those that do not resolve with Algorithm #1, apply Algorithm #2; etc. For example, now that I know that have classified the data item as vendor master data (previous step), can I resolve the identity with certainty to identify which vendor it is?
Link	The resolved entity must be linked to internal and external reference sources. Machine Learning techniques may be used to identify and resolve link candidates and specify link type / strength. The analytical details of this may be addressed in the above “Resolve” capability. However, the focus here should be on identifying the correct link (or links) where there are multiple candidate reference sets where links could be established. This is a critical step as the linkage to the internal reference “Concept System” is what describes the data element from a semantic perspective. It is also what links the data being described to a publicly available set of definitions that external parties can reference (See “Sharable Information” in figure above). These linkages cross walk an industry accepted definition between supply chain partners. Example: If a supply chain manager seeks to communicate the nature of a product requirement to a vendor – a machine screw for example. The ability to specify length of screw versus length of the “shoulder” on the screw; thread size (Metric, standard, imperial?); type of head (hex, square, pan head, etc.) is critical. The internal labels for these are linked to the industry agreed on labels available to the vendor community. As long as the vendor is using the same reference concept system, both buyer and vendor can be assured that they are talking about the same machine screw.

Once these activities have been completed, the results need to be persisted in a metadata repository and published in a Data Catalog that will allow users to understand what data is available and how it can be accessed.

Some Closing Thoughts :: It’s all about the Ecosystem Maturity!

The above discussion and the content of the two posts in the works on MDM standards and data quality, identifies a set of standards and techniques that seek to streamline and automate the process of Master Data Management. However, these exist within the context of the organization’s data ecosystem. Data practitioners seeking to evolve master data management must ask some core questions regarding information architecture and data management maturity within their ecosystem:

How do these standards support my data strategy?
- Do I have a business case?
- Executive sponsorship?
- Funding?
Does my information architecture support the capabilities that I need to manage Master Data as envisioned by the standards?
- Will legacy systems impact how this gets executed?
- Does the architecture support a “Service Oriented” metadata registry or catalog concept?
- Do I have a metadata catalog?
- What are the architectural boundaries and how do I share data across those boundaries?
Do I have the data management maturity to execute?
- Identified and scalable processes?
- Processes applied consistently across business units?
- A governance operating model that can accommodate new functions and the change management overhead?
- What controls and metrics exist? Need to be created?

Understanding how standards and machine learning fit within the information architecture and the organization’s capability maturity will enable the data team to define the right strategy and build out a realistic roadmap. For organizations with an established and mature governance function, many of the above questions will be resolved – or the mechanism to resolve them exists. However, for organizations that have less capability maturity, the strategy and roadmap will need to be explicit in identifying the business units where foundational capabilities can be created that can later be adopted across the organizations as the need and maturity evolve.

For an additional perspective on the business cases that might drive ML, see Classification – the key to releasing data’s value !

Tags: architecture, best practices, Machine Learning, MDM, ML

Comments 1 Comment
Categories analytics, Big Data, Classification, Data Management, Master Data, Metadata, Standards
Author analyticaltern

Classification – the key to releasing data’s value !

6 Jun

Someone asked me the other day what the business case was for classifying data. For anyone that has engaged with data to perform analytics or produce business intelligence reports, this may seem like a silly question. However, in many minds, the data does not need to be labelled or classified in any way. The data is used by an application and if that application is performing correctly, the data must be good. And, at some level they are right – as long as the data involved never has to be used outside its application, it may never need to be classified or labelled in any way. The data receives all of it semantic context from the application where it is used.

So when does classification become important? It becomes important when data leaves the application that gave it context. For many of our customers this occurs when data leaves the transactional ERP type system, and is moved into a data warehouse or a data lake whose purpose is to provide access to data from multiple sources. Traditionally, this movement from transactional to a more generally accessible repository came with a level of curation. Prior to the concept of the “Data Lake,” data was moved into the data warehouse with the goal of making it the “single source” of truth. This often involved significant levels of data stewardship and curation to reconcile conflicting versions of “truth.” With the growing awareness and adoption of analytics, the idea of a stable concept of “truth” is elusive. The right data for an analyst is context driven and at times highly variable. The Data Lake construct addresses this issue by allowing all data to be loaded so that the user can determine what data to use based on the decision context at the time. This is what data classification enables. Well classified data can be discovered, analyzed, accessed and integrated into a user’s context based on the classification labels that have been exposed to the user in the Data Asset catalog. Based on this perspective, classification is foundational for driving value out of data in the areas of analytics, business intelligence, operational efficiencies, and compliance.

Indeed in the big data space, classification is foundational for analytics, machine learning, the application of higher level logic, and (way up the maturity curve) for building artificial intelligence capabilities. As a foundational building block for Ai, classification is an interesting topic; although for many too abstracted from today’s problems. However, as the foundation for making data discoverable, understandable, accessible and able to be integrated into downstream applications, it is highly relevant to today’s challenges – almost regardless of where your current capabilities stand. For this reason any data management shop should include in its planning a workstream that seeks to evolve classification capabilities

Consider the following uses cases:

Business Intelligence: marketers seeking to report on price sensitivity and are comparing the difference between prices quoted, prices invoiced, and prices paid net of discount. Data across all of the ERP or transactional systems in use must be classified such that the BI Team is assured that all fields marked as “Price” are the correct type of price.

Marketing Analytics: Your customer 360ᵒ program seeks to understand external factors that may have influenced pricing and discounts provided. What customers are related to the prices referenced above? What kind of customers are they (industry, buying frequency, average purchase, …)? How can I correlate those with external events (elections, new regulation, natural disasters, …)? All of this analysis is supported by data that is classified to reflect the types of queries that may occur and analytical operations to be performed.

Operational Efficiency: Your COO wants to ensure that the acquisition process is fully optimized, and seeks to benchmark operations using the SCOR (Supply Chain Operations Reference) Model. The Operations Team downloads the 250 SCOR performance metrics and seeks to map those to the relevant data. Classification supports the ability to find the right data and map it to the data specified in the SCOR Model.

Compliance & Risk Management. Risk teams will rely on well classified data to enable risk models that are robust and flexible in their ability to address evolving risk. This is especially the case for risk associated with adaptive threats; for example fraud and cyber-crime.

Bottom line, if classification is not something that you have thought about, consider putting a plan together. It is the key to releasing the value of your data, and fully leveraging data as an asset.

Tags: Classification

Comments 2 Comments
Categories Best Practices, BI, Big Data, Classification, Data Management, Metadata
Author analyticaltern

Forensic Analytics and the search for “robust” solutions

12 Jan

Happy New Year!

This entry has been sitting in my “to publish” file for some time. There is much more to be said on the topic. however, in the interest of getting it out … enjoy!

=======================================================

This entry was prompted by the article in the INFORMS ANALYTICS Magazine article titled Forensic Analytics: Adapting to a Growing Pandemic by Priti Ravi who is a senior manager with Mu Sigma and specializes “in providing analytics-driven advisory services to some of the largest retail, pharmaceutical and technology clients spread across the United States.”

Ms. Ravi writes a good article that left me hanging. Her conclusion was that the industry lacks access to sophisticated and intelligent monitoring equipment, and there exists a need for a “robust fraud management systems” that “offer a collective set of techniques” to implement a “complex adaptive approach.” I could not agree more. However, where are these systems? Perhaps even what are these systems?

Adaptive Approaches

To the last question first. What is a Complex Adaptive Approach? If you Google the phrase, the initial entries involve biology and ecosystems. However, wikipedia’s definition encompasses medicine, business and economics (amongst others) as areas of applicability. From an analytics perspective, I define complex adaptive challenges as those that are impacted by the execution of the analytics – by doing the analysis, the observed behaviors change. This is inherently true of fraud as the moment perpetrators understand (or believe) they can be detected, behavior will change. However, it also applies to a host of other type of challenges: criminal activity, regulatory compliance enforcement, national security; as well as things like consumer marketing and financial investment.

In an article titled Images & Video: Really Big Data the authors (Fritz Venter the director of technology at AYATA; and Andrew Stein the chief adviser at the Pervasive Strategy Group. define an approach they call “prescriptive analytics” that is ideally suited to adaptive challenges. They define prescriptive analytics as follows:

“Prescriptive analytics leverages the emergence of big data and computational and scientific advances in the fields of statistics, mathematics, operations research, business rules and machine learning. Prescriptive analytics is essentially this chain of transformations whereby structured and unstructured big data is processed through intermediate representations to create a set of prescriptions (suggested future actions). These actions are essentially changes (over a future time frame) to variables that influence metrics of interest to an enterprise, government or another institution.”

My less wordy definition: adaptive approaches deliver a broad set of analytical capabilities that enables a diverse set of integrated techniques to be applied recursively.

What Does the Robust Solution Look Like?

Defining adaptive analytics this way, one can identify characteristics of the ideal “robust” solution as follows:

A solution that builds out a framework that supports the broad array of techniques required.
A solution that is able to deal with the the challenges of recursive processing. This is very data and systems intensive. Essentially for every observation evaluated, the system must determine whether or not the observation changes any PRIOR observation or assertion.
A solution that engages users and subject matter experts to effectively integrate business rules. In an environment where traditional predictive analytic models have a short shelf life (See Note 1), engaging with the user community is often the mechanism to quickly capture environmental changes. For example, in the banking world, tracking call center activity will often identify changes in fraud behavior faster than a neural network set of models. Engaging the User in the analytical process will require user interfaces, and data visualization approaches that are targeted at the user population, and integrate with the organization’s work processes. Visualization will engage non technical users to help them apply their experience and intuition to the data to expose insights. The census bureau has an interesting page, and if you look at Google Images, you can get an idea of visualization approaches.
A solution that provides native support for statistical and mathematical functions supporting activities associated with data mining : clustering, correlation, pattern discovery, outlier detection, etc.
A solution that structures unstructured data: categorize, cluster, summarize, tag/extract. Of particular importance here is the ability to structure text or other unstructured data into taxonomies or ontologies related to the domain in question.
A solution that persists data with the rich set of metadata required to support complex analytics. While it is clearer why unstructured data must be organized into a taxonomy / ontology, this also applies to structured data. Organizing data consistently across the variety of sources allows non obvious relationships to be exposed, and application of more complex analytical approaches.
A solution that is relatively data agnostic – data will come from many places and exist in many forms. The solution must manage the diversity and provide a flexible way to integrate new data into the analytical framework.

What are Candidate Tools ?

And now to the second question: where are these tools? It is hard to find tools that claim to be “adaptive analytic” tools; or “prescriptive analytics” tools or systems in the sense that I have described them above. I find it interesting that over the last five years, major vendors have subsumed complex analytical capabilities into a more easily understandable components. Specifically, you used to be able to find Microsoft Analytical Services easily on their site. Now it is part of MS SQL Server as SSAS; much the same way that the reporting service is now part of the database offer as SSRS (reporting services). There was a time a few years ago when you had to look really hard on the MS site to find Analytical Services. Of course since then Microsoft has integrated various BI acquisitions into the offer and squared away their marketing communication. Now their positioning is squarely around BI and the database. Both of these concepts are easier to sell at the executive level, than the notion of prescriptive or adaptive analytics.

The emergence of databases and appliances optimized around analytics has simplified the message on the data side. everyone knows they need a database, and now they have one for analytics. At the decision maker level, that is a much easier decision than trying to figure out what kind of analytical approach the organization is going to adopt. People like Teradata have always supported analytics through the integration of SAS and now R as in-database functionality. However, Greenplum, Neteeza and others have incorporated SAS and the open source analytical “R” . In addition, we have seen the emergence (not new but much more talked about it seems) of the columnar database. The one I hear about most is the Sybase IQ product; although there have been a number of posts on the topic on here, here, and here.

My point here is that vendors have too hard a time selling complex analytical solutions, and have subsumed the complex capabilities into the concepts that are easier to package, position and communicate around; namely; database products and Business Intelligence products. The following are product sets that are candidates for the integrated approach. We start with the big players first and work towards that are less obviously candidates.

SAS

The SAS Fraud Framework provides an integration of all the SAS components that required to implement a comprehensive analytics solution around adaptive challenges (all kinds of fraud, compliance, money laundering, etc. as examples). This is a comprehensive suite of capabilities that spans all activities: data capture, ingest, and quality; analytics tools (including algorithm libraries), data visualization and reporting / BI capabilities. Keep in mind that SAS is a company that sells the building blocks, and the Fraud Framework is just that, a framework within which customers can build out capabilities. This is not a simple plug and play implementation process. It takes time and investment and the right team within the organization. The training has improved, and it is now possible to get comprehensive training.

As with any implementation of SAS, this one comes with all the caveats associated with comprehensive enterprise systems that integrate analytics into the fabric of an organization. The Gartner 2013 BI report indicates that SAS “very difficult to implement”. This theme echoes across the product set. Having said that when it comes to integrated analytic of the kind we have been discussing all, of the major vendors suffer from the same implementation challenges – although perhaps for different reasons.

Bottom line however, is that SAS is a company grounded in analytics – the Fraud Framework has everything needed to build out a first class system. However, the corporate culture builds products for hard core quants, and this is reflected in the Gartner comments.

IBM

IBM is another company that has the complete offer. They have invested heavily in the analytics space, and between their ETL tools; the database/ appliance and Big Data capabilities; the statistical product set that builds off SPSS; and, the Cognos BI suite users can build out the capabilities required. Although these products are being integrated into a seamless set of capabilities, they remain somewhat separate and this probably explains some of the implementation challenges reports. Also, the product side of the IBM operation does not necessarily speak with the Global Services side of the house.

I had thought when IBM purchased Systems Research & Development (SRD) in 2005 that they were going to build out capabilities that SRD and Jeff Jonas had developed. Jeff heads up the Entity Analytics group within IBM Research, and his blog is well worth the read. However, the above product set appears to have remained separated from the approaches and intellectual knowledge that came with SRD. This may be on purpose – from a marketing perspective, buy the product set, and then buy IBM services to operationalize the system is not a bad approach.

Regardless, as the saying goes, no one ever got fired for buying IBM” probably still holds true. However, like SAS beware of the implementation! Any one of the above products (SPSS, Cognos, and Infosphere) require attention when implementing. However, when integrating as an operational whole, project leadership needs to ensure that expectations as to the complexity and time frame are communicated.

Other Products

There are many other product sets and I look forward to learning more about them. Once I post this, someone is going to come back and mention “R” and other open source products. There are plenty out there. However, be aware that while the products may be robust, many are not delivered as an integrated package.

With respect to open source tools, it is worth noting that the capabilities inherent in Hadoop – and the related products, lend themselves to adaptive analytics in the sense that operators can consistently re-link and re-index on the fly without having to deal with where and how the data is persisted. This is key in areas like signals intelligence, unstructured data analysis, and even structured data analysis where the notion of semantic equivalence is shifting. This is a juicy topic all by itself and worthy of a whole blog entry.

Notes:

Predictive analytics relies on past observations to predict future observations. In an adaptive environment, the inputs to those predictive models continually change as a result of the outputs using the past observations.

Tags: Adaptive, analytics, best practices, Big Data, BRE BRMS Adaptive, Prescriptive, products

Comments Leave a Comment
Categories Best Practices, BI, Big Data, methodologies
Author analyticaltern

A comparison of programming languages in economics

8 Jul

Interesting comparison of programming language speeds. Given that the big data world seems to be all about Python, I wonder if folks start doing complicated calculations over big data if they will move away from Python? SAS is apparently working on “Accelerators” to work on hadoop nodes which appear to address this same problem. They already have them for Databases and Db appliances.

The above makes sense if you consider that for the most part “big Data” is about folks doing simple calculations in parallel over many data nodes.

The thread of comments below the article are also interesting.

===================================

There is a new NBER working paper with that title, by S. Borağan Aruoba and Jesus Fernandez-Villaverde. Here is the abstract:

We solve the stochastic neoclassical growth model, the workhorse of modern macroeconomics, using C++11, Fortran 2008, Java, Julia, Python, Matlab, Mathematica, and R. We implement the same algorithm, value function iteration with grid search, in each of the languages. We report the execution times of the codes in a Mac and in a Windows computer and comment on the strength and weakness of each language.

Here are their results:

1. C++ and Fortran are still considerably faster than any other alternative, although one needs to be careful with the choice of compiler.

2. C++ compilers have advanced enough that, contrary to the situation in the 1990s and some folk wisdom, C++ code runs slightly faster (5-7 percent) than Fortran code.

3. Julia, with its just-in-time compiler, delivers outstanding per formance. Execution speed is only between 2.64 and 2.70 times the execution speed of the best C++ compiler.

4. Baseline Python was slow. Using the Pypy implementation, it runs around 44 times slower than in C++. Using the default CPython interpreter, the code runs between 155 and 269 times slower than in C++.

5. However, a relatively small rewriting of the code and the use of Numba (a just-in-time compiler for Python that uses decorators) dramatically improves Python ’s performance: the decorated code runs only between 1.57 and 1.62 times slower than the best C++ executable.

6.Matlab is between 9 to 11 times slower than the best C++ executable. When combined with Mex files, though, the difference is only 1.24 to 1.64 times.

7. R runs between 500 to 700 times slower than C++ . If the code is compiled, the code is between 240 to 340 times slower.

8. Mathematica can deliver excellent speed, about four times slower than C++, but only after a considerable rewriting of the code to take advantage of the peculiarities of the language. The baseline version our algorithm in Mathematica is much slower, even after taking advantage of Mathematica compilation.

There are ungated copies and some discussion here.

Tags: Big Data, database, languages, products, python, speed

Comments Leave a Comment
Categories Big Data, methodologies, Products
Author analyticaltern

Me, My Data & I

29 Jan

One of the ways in which big data is moving into the mainstream is through a growing culture of self-measurement. We see this through the smartphone apps that track the number of daily-diet calories added by lunch at Chipotle. New fitness watches notify us when we sit idle for too long, and track the amount of light vs. deep REM sleep we get each night. When a sudden, out of the ordinary, credit card purchase sparks a stolen-card call from the bank, we are reminded that day-by-day, quietly and invisibly, our individual level data is collected, stored, and even analyzed without our conscience awareness. A recent article in The Guardian makes interesting commentary about this phenomenon. It focuses mainly on healthcare, but I think it has larger implications for individual-level data outside of the sector.

You can find the article here: http://www.theguardian.com/science/political-science/2014/jan/27/science-policy

In a nutshell, the author writes about an experience with error in her smartphone’s running app, which prompts her to question and seek to review the previous data collected about her through the application. Of course, like most apps, this one does not allow her to retrieve the data collected about her by the system. Instead, she and other users will experience the cumulative effect of data collected through new products and updated apps. After several examples that illustrate the potential value of allowing users to access to data collected through such applications, the author concludes that doing so could improve the health of everyone in addition to generating better products.

I think the article raises a valid question about big data that few people discuss. Don’t we miss opportunities for value creation by not allowing individuals to access the data collected about them? A very obvious example in healthcare would be the case of individuals living with type II diabetes – a disease whereby the body’s insulin no longer effectively brings glucose into its cells. Because of this condition, individuals living with type II diabetes depend dramatically on lifestyle and other daily choices for their health. Liberating individual-level data collected through diabetes apps might prove particularly beneficial, both for individuals and for the health system, if people are allowed to use the data to find their own creative ways of improving their health.

However interesting this question may be, I’d like to take it a step further. What potential dark clouds might arise in an otherwise sunny horizon for big data if people aren’t allowed to access data collected about them? One potential problem is that the players who have authority to access and analyze this data may use it to pursue objectives that create value only for themselves. In healthcare, this might manifest as a company using its big data insights to aggressively market to a particular demographic that predominantly suffers from a given disease. What is the downside in this scenario? For starters, many of the recipients of this aggressive marketing may resultantly seek and receive unnecessary treatment that may prove harmful to their health, while those in other markets who actually need the treatment may go without it due to unequal exposure. Another potential downside stemming from this scenario is that it potentially destroys value, especially if payers spend more money on treatments that do not generate value for patients.

In the end, it is good food for thought. Beyond reaping the fruits of product development, consumers might stand to benefit from having access to big data that they help to supply.

Tags: analytics, app, Big Data, consumers, Health, healthcare, individuals, smartphone, Social Media, users

Comments Leave a Comment
Categories Big Data, Healthcare, Metadata, Social Media
Author intellectinject13

The addition of analytical functions to databases

14 Nov

The trend has been for database vendors to integrate analytical functions into their products; thereby moving the analytics closer to the data (versus moving the data to the analytics). Interesting comments in the article below on Curt Monash’s excellent blog.

What was interesting to me, was not the central premise of the story that Curt does not “think [Teradata’s] library of pre-built analytic packages has been a big success”, but rather the BI vendors that are reportedly planning to integrate to those libraries: Tableau, TIBCO Spotfire, and Alteryx. This is interesting as these are the rapid risers in the space, who have risen to prominence on the basis of data visualization and ease of use – not on the basis of their statistical analytics or big data prowess.

Tableau and Spotfire specifically focused on ease of use and visualization as an extension of Excel spreadsheets. They have more recently started to market themselves as being able to deal with “big data” (i.e. being Hadoop buzzword compliant). With the integration to a Teradata stack and presumably integrating front end functionality into some of these back end capabilities, one might expect to see some interesting features. TIBCO actually acquired an analytics company. Are they finally going to integrate the lot on top of a database? I have said it before, and I will say it again, TIBCO has the ESB (Enterprise Service Bus), the visualization tool in Spotfire and the analytical product (Insightful); hooking these all together on a Teradata stack would make a lot of sense – especially since Teradata and TIBCO are both well established in the financial sector. To be fair to TIBCO, they seem to be moving in this direction, but it has been some time since I used the product).

Alteryx is interesting to me in that they have gone after SAS in a big way. I read their white paper and downloaded the free product. They keep harping on the fact that they are simpler to use than SAS, and the white paper is fierce in its criticism of SAS. I gave their tool a quick run through, and came away with two thoughts: 1) the interface while it does not require coding/script as SAS does, cannot really be called simple; and 2) they are not trying to do the same things as SAS. SAS occupies a different space in the BI world than these tools have traditionally occupied. However,…

Do these tools begin to move into the SAS space by integrating onto foundational data capabilities? The reason SAS is less easy to use than the products of these rapidly growing players is that the rapidly growing players have not tackled the really tough analytics problems in the big data space. The moment they start to tackle big data mining problems requiring complex and recursive analytics, will they start to look more like SAS? If you think I am picking on SAS, swap out SAS for the IBM Cognos, SPSS, Netezza, Streams, Big Insights stack, and see how easy that is! Not to mention the price tag that comes with it.

What is certain is that these “new” players in the Statistical and BI spaces will do whatever they can to make advanced capabilities available to a broader audience than traditionally has been the case with SAS or SPSS (IBM). This will have the effect of making analytically enhanced insights more broadly available within organizations – that has to be a good thing.

Article Link and copy below

October 10, 2013

Libraries in Teradata Aster

I recently wrote (emphasis added):

My clients at Teradata Aster probably see things differently, but I don’t think their library of pre-built analytic packages has been a big success. The same goes for other analytic platform vendors who have done similar (generally lesser) things. I believe that this is because such limited libraries don’t do enough of what users want.

The bolded part has been, shall we say, confirmed. As Randy Lea tells it, Teradata Aster sales qualification includes the determination that at least one SQL-MR operator — be relevant to the use case. (“Operator” seems to be the word now, rather than “function”.) Randy agreed that some users prefer hand-coding, but believes a large majority would like to push work to data analysts/business analysts who might have strong SQL skills, but be less adept at general mathematical programming.

This phrasing will all be less accurate after the release of Aster 6, which extends Aster’s capabilities beyond the trinity of SQL, the SQL-MR library, and Aster-supported hand-coding.

Randy also said:

A typical Teradata Aster production customer uses 8-12 of the prebuilt functions (but now they seem to be called operators).
nPath is used in almost every Aster account. (And by now nPath has morphed into a family of about 5 different things.)
The Aster collaborative filtering operator is used in almost every account.
Ditto a/the text operator.
Several business intelligence vendors are partnering for direct access to selected Teradata Aster operators — mentioned were Tableau, TIBCO Spotfire, and Alteryx.
I don’t know whether this is on the strength of a specific operator or not, but Aster is used to help with predictive parts failure applications in multiple industries.

And Randy seemed to agree when I put words in his mouth to the effect that the prebuilt operators save users months of development time.

Meanwhile, Teradata Aster has started a whole new library for relationship analytics.

Tags: Alteryx, Aster, BI, Big Data, Cognos, IBM, Spotfire, SPSS, Tableau, Teradata, TIBCO, visualization

Comments Leave a Comment
Categories BI, Big Data, Products, Visualization
Author analyticaltern

Analytics keeps moving closer to the data!

18 Oct

http://feedproxy.google.com/~r/dbms2/feed/~3/QOuK0EQFRzs/

Note the list of partners – all have a background in visualization and analyst driven capabilities – not big data munging. Where does this leave the companies that are neither visualization, nor database companies? Companies like SAS.

Tags: industry, sas teradata aster, trends

Comments Leave a Comment
Categories Big Data, Data Stores, Industry, Products
Author analyticaltern

Primer on Big Data, Hadoop and “In-memory” Data Clouds

25 Aug

This is a good article. There have been a number of articles recently on the hype of big data, but the fact of the matter is that the technology related to what people are calling “big data” is here to stay, and it is going to change the way complex problems are handled. This article provides an overview. For those looking for products, this has a good set of links.

This is a good companion piece to the articles by Wayne Eckerson referenced in this post

Tags: Big Data, COTS, Open Source

Comments Leave a Comment
Categories Big Data, Products, Uncategorized, Visualization
Author analyticaltern

Databases & Analytics – what database approach works best?

1 Aug

Every once in a while the question comes up as to what is the “right” database for analytics. How do organizations move from their current data environments to environments that are able to support the needs of Big Data and Analytics? It was not too long ago that the predominant answer was a relational database; moreover these were often organized around a highly normalized structure that arranged the fields and tables of a relational database to minimize redundancy and dependency (See also).

These structures to a large extent existed to optimize database efficiencies – or sidestep inefficiencies – in a world that was memory and / or hardware constrained; think 20+ years ago. Many of these constraints no longer exist which has created more choices for practitioners in how to store data. This is especially true of data repositories that built to support analytics as a highly normalized structure is often inefficient and cumbersome for analytics. Matching the data design and management approaches to the need improves performance and reduces operational complexity and with it costs.

The table below lays out some of the approaches and where they might apply. Note, these are not mutually exclusive: one can persist semantic data in a relational database for example. Also, this is not exhaustive by any means. The table below provides a starting point for those that are considering how their data environments should evolve as they seek to move from their legacy environment to one that supports new demands created by the need for analytics.

Data Design Approach	Analytical Activity
Relational. In this context “relational” refers to data stored in rows.	Relational structures are best used when the data structures are known and change infrequently. Relational designs often present challenges for analysts when queries and joins are executed that are incompatible with the design schema and / or indexing approach. This incompatibility creates processing bottlenecks, and resource challenges resulting in delays for data management teams. In the context of analytics the challenges associated with this form of data persistence are discussed in other posts, and a favorite Exploiting Big Data Strategies for Integrating with Hadoop by Wayne Eckerson; Published: June 1, 2012.
Columnar. Columnar data stores might also be considered relational. However, their orientation is around the column versus the row.	Columnar data designs lend themselves to analytical tasking involving large data sets where rapid search and retrieval in large data tables is a priority. See previous post on how columnar databases work. A columnar approach inherently creates vertical partitioning across the datasets stored this way. Columnar DBs allow for retrieval of only a subset of the columns and some columnar DBs allow for processing data in a compressed form. All this minimizes I/O for large retrievals.
Semantic	A semantic organization of data lends itself to analytical tasks where the understanding of complex and evolving relationships is key. This is especially the case where ontologies are required to organize entities and their relationships to one another: corporate hierarchies/networks; insider trading analysis for example. This approach to organizing data is often represented in the context of the “semantic web” whose organizing constructs are RDF and OWL.
File Based	File based approaches such as those used in Hadoop and SAS systems lend themselves to situations where data must be acquired and landed. However, the required organizational structure or analytical “context” is not yet defined. Data can be landed with minimal processing and made available for analysis in relatively raw form. Under certain circumstances file based approaches can improve performance as they are more easily used in MPP (Massively Parallel Processing or distributed computing) environments. Performance improvements will exist where data size is very large, and functions performed are “embarrassingly parallel“, and can work on platforms that designed around a “shared nothing architecture“; which is the architecture supporting Hadoop. The links referenced within the Relational section above speak to why and when you use Hadoop:see recent posts, and Exploiting Big Data Strategies for Integrating with Hadoop.

There are a few interesting papers on the topic – somewhat dated but still useful:

Curt Monash’s site has a presentation worth looking at title: How to Select an Analytical Database. In general, Curt’s blog DBMS2is well worth tracking.

This deck presented by Mark Madsen at 2011 Strata Conference is both informative and amusing.

Determine the Right Analytic Database: A Survey of New Data Technologies from mark madsen

This is a bioinformatics deck that was interesting. It does not have a date on it. However, good information from a field that has driven developments in approaches to dealing with large complex data problems.

Tags: Big Data, Data Persistence, database

Comments Leave a Comment
Categories Best Practices, Big Data, Data Management, Data Stores
Author analyticaltern

Slowly Sowly the various pieces to build out the Hadoop vision are coming together

27 Jun

This article talks about Hunk (Splunk + Hadoop). It is a good example of how the various pieces are coming together to enable the Hadoop vision to become reality. Where will the mainstream commercial off the shelf folks be? Many of the mainstream data vendors are not moving their code into Hadoop (as Hunk is), but rather moving the extract into their machines.

There are some folks, who believe that the COTS products will win in the end, as they are going to reduce the cost to operate – or the total cost of ownership – to the point that it does not matter how “free” the open source stuff is. On the other hand there are company’s that are going the other way, and creating non traditional support models around open source software. This started with Red Hat. However, Red Hat still has the same approach to licensing – no matter what you use in terms of support, you still pay the license fee – that sounds like Oracle redux to me. I have seen that in a number of other Open Source product vendors as well. The new trend may be to support Open Source tools with a “pay for what you use” type menu. We will see how that goes. In the meantime, who names these products?

Tags: Big Data, COTS, Open Source, products

Comments Leave a Comment
Categories Big Data, Data Management, Products
Author analyticaltern

← Older Entries

Search

analyticaltern

Automating Data Management and Governance through Machine Learning

Some Closing Thoughts :: It’s all about the Ecosystem Maturity!

Classification – the key to releasing data’s value !

Forensic Analytics and the search for “robust” solutions

A comparison of programming languages in economics

The addition of analytical functions to databases

Libraries in Teradata Aster

Analytics keeps moving closer to the data!

Primer on Big Data, Hadoop and “In-memory” Data Clouds

Databases & Analytics – what database approach works best?

Slowly Sowly the various pieces to build out the Hadoop vision are coming together

Recent Posts

Archives

Follow Blog via Email

Interesting Tags

Wayne Erikson

Pages

Search

Some Closing Thoughts :: It’s all about the Ecosystem Maturity!

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Recent Posts

Archives

Follow Blog via Email

Pages