Archive | Industry RSS feed for this section

Evaluating Different Persistence Methods as part of the Planning Process

4 Jul

Every once in a while, I get asked about how to select between different types of databases. Generally, this comment is as a result of a product vendor or consultant making a recommendation to evolve towards a Big Data solution. The issue is twofold in that companies seek to understand what the next generation data platform looks like; AND, how or if their current environment can evolve. This involves understanding the pros and cons of the current product set and to what degree they can exist with newer approaches – Hadoop being the current platform people talk about.

The following is a list of data persistence approaches that helps at least define the options. This was done some time ago, so I am sure the vendors shown have evolved. However, think of it as a starting point to frame the discussion.

In general, one wants to anchor these discussions in some defined criteria that can help frame the discussion within the context of  business drivers. In the following figure, the goal is to show that as data sources and consumers of your data expand to include increasingly complex data structures and “contexts,” there is a need to evolve approaches beyond the traditional relational database (RDBMS) approaches. Different organizations will have different criteria. I provide this as a rubric that has worked before – you will need to create an approach that works for your organization or client.

Evolution of Data Persistence

A number of data persistence approaches  support the functional components as defined. These are described below.

Defined Pros / Cons Vendor examples
Relational Databases

(Row Orientation)

Traditional normalized data models optimized for efficiently storing data Relational structures are best used when the data structures are known and change infrequently. Relational designs often present challenges for analysts when queries and joins are executed that are incompatible with the design schema and / or indexing approach. This incompatibility creates processing bottlenecks, and resource challenges resulting in delays for data management teams. This approach is challenged when dealing with complex semantic data where multiple levels of parent / child relationships exist.

Advantages: This approach is best for transactional data where the relationships between the data and the use cases driving how data is accessed and used are stable. In uses where relational integrity is important and must be enforced in a consistent manner, this approach can work well. In a row based approach, contention on record locking are easier to manage than other methods.

Disadvantages: As the relationships between data and relational integrity are enforced through the application of a rigid data model, this approach is inflexible, and changes can be hard to  implement.

All major database vendors: IBM – DB2; Oracle; MS SQL and others
Columnar Databases

(Column Oriented)

Data organized  or indexed around columns; can be implemented in SQL or a NoSQL environments. Advantages: Columnar data designs lend themselves to analytical tasking involving large data sets where rapid search, retrieval and aggregation type queries are performed on large data tables. A columnar approach inherently creates vertical partitioning across the datasets stored this way. It is efficient and scalable.

Disadvantages: efficiencies can be offset by  the need to join many queries to obtain the desired result.

•Sybase IQ

•InfoBright

•Vertica (HP)

•Par Accel

•MS SQL 2012

Defined Pros / Cons Vendor examples
RDF Triple Stores / Databases Data stored organized around RDF triples (Actor-action-object OR Subject-predicate-Object); can be implemented in SQL or a NoSQL environments. Advantages: A semantic organization of data lends itself to analytical and knowledge management tasks where the understanding of complex and evolving relationships is key. This is especially the case where ontologies or SKOS (1) type relationships are required to organize entities and their relationships to one another: corporate hierarchies/networks; insider trading analysis for example. This approach to organizing data is often represented in the context of the “semantic web” whose organizing constructs are RDF and OWL. when dealing with complex semantic data where multiple levels of parent / child relationships exist, this approach is more efficient that RDBMS

Disadvantages: This approach to storing data is often not as efficient as relational approaches. It can be complicated to write queries to traverse complex networks – however, this is often not much easier in relational databases either.

Note: these can be implemented with XML  formatting or in some other form.

Native XML  / RDF Databases

•Marklogic (COTS)

•OpenLink Virtuoso (COTS)

•Stardog (o/s, COTS)

•BaseX  (o/s)

•eXist  (o/s)

•Sedna (o/s)

XML Enabled Databases

•IBM DB2

•MS SQL

•Oracle

•PostgrSQL

XML enabled databases deal with XML as a CLOB in a table or organized into tables based on a schema

Graph Databases A database that uses graph structures to store data. See XML / RDF Stores / Databases. Graph Databases are a variant on this theme.

Advantages:  Used primarily to store information on networks. Optimized for iterative joins; often in a recursive process (2)..

Disadvantages: Storage challenges – these are large datasets; builds through iterative joins – very processor intensive.

•ArangoDB

•OrientDB

•Cayley

•Aurelius Titan

•Aurelius Faunus

•Stardog

•Neo4J

•AllegroGraph

(1) SKOS = Simple Knowledge Organization Structure. Relationships can be expressed as triples; examples are “is part of”; “is similar to”

(2) Recursion versus iteration

Defined Pros / Cons Vendor Examples
NoSQL

File based storage – HDFS

Data structured to expose  insights through the use of “key pairs” This has many of the characteristics of the XML, Columnar and Graph approaches. In this instance, the data is loaded, and key value pair (KVP) files created external to the data. Think of the KVP as an index with a pointer back to the source data. This approach is generally associated with the Hadoop / MapReduce  capabilities, and the definition here assumes that KVP files are queried using the capabilities available in the Hadoop ecosystem

Advantages: flexibility; MPP capabilities; speed; schema-less; scalable; Great at creating views of data; and performing simple calculations across Big Data; significant open source community – especially through the Apache Foundation. Shared nothing architecture optimizes the read process. However, it creates challenges in meeting ACID (1) requirements. File based storage systems adhere to the BASE (2) requirements

Disadvantages: Share nothing architecture creates complexity in uses where sequencing of transactions or writing data is important – especially when multiple nodes are involved; complex metadata requirement; few tool “packages” available to support production environments; relatively immature product set.

Document Store

•Mongo DB

•Couch DB

Column Store

•Cassandra

•Hbase

•Accumulo

Key Value Pair

•Redis

•Riak

(1) ACID = Atomicity; Consistent; Isolated; Durable. Used for Transaction processing systems.

(2) BASE = Basic Availability, Soft State; Eventual Consistency. Used for distributed parallel processing systems where maintaining complete consistency is often prohibitively expensive

Defined Pros / Cons Vendor examples
In-Memory Approaches Data approaches where the data is loaded into active memory to improve efficiency Note that multiple persistence approaches can be implemented in memory

Advantages: Speed; flexibility – ability to virtualize views and calculated / derived tables; think of Datamarts in the traditional BI context

Disadvantages: Hardware, cost

•SAP HANA

•SAS High Performance Analytics

•VoltDB

The classes of tools below are presented as they provide alternatives for capabilities that are likely to be required. Many of the capabilities are resident in some of the tool sets already discussed.
Data Virtualization The ability to produce tables or views without going through an ETL process Data  virtualization is a capability built into other products. Any In- Memory product inherently virtualizes data. Likewise a number of the Enterprise BI tools allow data – generally in the form of “cubes” to be virtualized. Denodo Technologies is the major pure play vendor. The others vendors generally provide products that are part of larger suites of tools. •Composite Software (Cisco)

•Denodo Technologies

•Informatica

•IBM

•MS

•SAP

•Oracle

Search Engines Data management components that are used to search structured and unstructured data Search engines and appliances perform functions as simple as indexing data, and as complex as Natural Language Processing (NLP) and entity extraction. They are referenced here as the functionality can be implemented as stand alone capability and may be considered as part of the overall capability stack. •Google Search Appliance

•Elastic Search

Defined Pros / Cons Vendor examples
Hybrid Approaches Data products that implement both SQL and NoSQL approaches These are traditional SQL database approaches that have been partnered with one or more of the approaches defined above. Teradata acquired Aster to create a “bolt on” to a traditional SQL Db; IBM has Db2/Netezza/Big Insights. SAS uses a file based storage system and has created “Access Modules” that work though Apache HIVE to apply analytics within either an HDFS environment, or the SAS environment.

Another hybrid approach is exemplified by Cassandra that incorporates elements of a data model within a HDFS based system.

One also sees organizations implementing HDFS / RDBMS solutions for different functions. For example acquiring, landing and staging data using an HDFS approach, and then once requirements and the business use is known creating structured data models to facilitate and control delivery

Advantages: Integrated solutions; ability to leverage legacy; more developed toolkits to support production operations. Compared to open source, production ready solutions require less configuration and code development.

Disadvantages: Tend to be costly; architecture tends to be inflexible – all or nothing mindset.

•Teradata

•EMC

•SAS

•IBM

•Cassandra (Apache)

Advertisement

Formalizing and optimizing your risk architecture within the BCBS context

1 Jul

This article was interesting to me for two reasons: 1) It formalized a data view of risk management within the financial community around the context of BCBS (and for BCBS 239 related to Risk Data Aggregation and Reporting); and, 2) it provided an interesting perspective on governance / data quality and associated metrics.

This paragraph sums it up:

“What is actually happening in practice is that each major institution’s regional banks are lobbying/negotiating with their local/regional regulators to agree on an initial form of compliance – typically as some form of MS Word or PowerPoint presentation to prove that that they have an understanding of their risk data architecture. One organization might write up 100 page tome to show its understanding – and another might write up 10. The “ask” is vague and the interpretation is subjective. Just what is adequate?”

The perspective on governance that  the article proposed is a way to “systematically compare different architectures” with a set of metrics that was understandable, obtainable, and actionable.

Have a read – let me know your thoughts. The article also provides a nice summary of the BCBS requirements.

Old School vs. new school – Its Both!

24 Oct

Excellent article by Wayne Eckerson (most of his are) . We give Data Warehouses a bad name because they have been implemented in a way that does not meet the businesses needs – certainly not from an analytical perspective. HOWEVER, the business reasons that they exist remain, and this is Wayne’s point. I have been watching the shouting match between Inmon and Kimball.  I think they are both wrong – the answer is not as simple as they make it out to be – our world will be hybrid SQL/RDBMS and NoSQL and everything will need to play nice together! Those are my words  of wisdom on a Friday 🙂

A comparison of programming languages in economics

8 Jul

Interesting comparison of programming language speeds. Given that the big data world seems to be all about Python, I wonder if folks start doing complicated calculations over big data if they will move away from Python? SAS is apparently working on “Accelerators” to work on hadoop nodes which appear to address this same problem. They already have them for Databases and Db appliances.

The above makes sense if you consider that for the most part “big Data” is about folks doing simple calculations in parallel  over many data nodes.

The thread of comments below the article are also interesting.

===================================

There is a new NBER working paper with that title, by S. Borağan Aruoba and Jesus Fernandez-Villaverde. Here is the abstract:

We solve the stochastic neoclassical growth model, the workhorse of modern macroeconomics, using C++11, Fortran 2008, Java, Julia, Python, Matlab, Mathematica, and R. We implement the same algorithm, value function iteration with grid search, in each of the languages. We report the execution times of the codes in a Mac and in a Windows computer and comment on the strength and weakness of each language.

Here are their results:

1. C++ and Fortran are still considerably faster than any other alternative, although one needs to be careful with the choice of compiler.

2. C++ compilers have advanced enough that, contrary to the situation in the 1990s and some folk wisdom, C++ code runs slightly faster (5-7 percent) than Fortran code.

3. Julia, with its just-in-time compiler, delivers outstanding per formance. Execution speed is only between 2.64 and 2.70 times the execution speed of the best C++ compiler.

4. Baseline Python was slow. Using the Pypy implementation, it runs around 44 times slower than in C++. Using the default CPython interpreter, the code runs between 155 and 269 times slower than in C++.

5. However, a relatively small rewriting of the code and the use of Numba (a just-in-time compiler for Python that uses decorators) dramatically improves Python ’s performance: the decorated code runs only between 1.57 and 1.62 times slower than the best C++ executable.

6.Matlab is between 9 to 11 times slower than the best C++ executable. When combined with Mex files, though, the difference is only 1.24 to 1.64 times.

7. R runs between 500 to 700 times slower than C++ . If the code is compiled, the code is between 240 to 340 times slower.

8. Mathematica can deliver excellent speed, about four times slower than C++, but only after a considerable rewriting of the code to take advantage of the peculiarities of the language. The baseline version our algorithm in Mathematica is much slower, even after taking advantage of Mathematica compilation.

There are ungated copies and some discussion here.

 

 

Self Service BI

29 May

Good article on Self Serve BI. The term has been around a while, but never seems to get old.

Interesting thought process to identify analytical approaches

29 Jan

Courtesy of a colleague in the medical data management world – check out this graphic. It is missing a few approaches, but lays out the thought process well.

Machine Learning - Cheastsheet

The Booz Allen Field Guide to Data Science has a similar linkage that is useful. That book can be downloaded here

While I am at at, I found this good book Managing Research Data by Graham-Pryor that focuses on managing research data. I continue to be surprised at the approaches taken by “traditional” data management folks to feed the analytical processes. The old school way of dealing with analytics data did not work well which has created some of the organizational work arounds that exist in companies. This only gets worse when dealing with large amounts of data, and data that must work across systems / sources.

Big Data and Marketing – Geoffrey Moore

15 Nov

Geoffrey Moore does a good job of explaining big data and marketing – always a cogent explanation of things.

The addition of analytical functions to databases

14 Nov

The trend has been for database vendors to integrate analytical functions into their products; thereby moving the analytics closer to the data (versus moving the data to the analytics). Interesting comments in the article below on Curt Monash’s excellent blog.

What was interesting to me, was not the central premise of the story that Curt does not  “think [Teradata’s] library of pre-built analytic packages has been a big success”, but rather the BI vendors that are reportedly planning to integrate to those libraries: Tableau, TIBCO Spotfire, and Alteryx. This is interesting as these are the rapid risers in the space, who have risen to prominence on the basis of data visualization and ease of use – not on the basis of their statistical analytics or big data prowess.

Tableau and Spotfire specifically focused on ease of use and visualization as an extension of Excel spreadsheets. They have more recently started to market themselves as being able to deal with “big data” (i.e. being Hadoop buzzword compliant). With the integration to a Teradata stack and presumably integrating front end functionality into some of these back end capabilities, one might expect to see some interesting features. TIBCO actually acquired an analytics company. Are they finally going to integrate the lot on top of a database? I have said it before, and I will say it again, TIBCO has the ESB (Enterprise Service Bus), the visualization tool in Spotfire and the analytical product (Insightful); hooking these all together on a Teradata stack would make a lot of sense – especially since Teradata and TIBCO are both well established in the financial sector. To be fair to TIBCO, they seem to be moving in this direction, but it has been some time since I used the product).

Alteryx is interesting to me in that they have gone after SAS in a big way. I read their white paper and downloaded the free product. They keep harping on the fact that they are simpler to use than SAS, and the white paper is fierce in its criticism of SAS. I gave their tool a quick run through, and came away with two thoughts: 1) the interface while it does not require coding/script as SAS does, cannot really be called simple; and 2) they are not trying to do the same things as SAS. SAS occupies a different space in the BI world than these tools have traditionally occupied. However,…

Do these tools begin to move into the SAS space by integrating onto foundational data capabilities? The reason SAS is less easy to use than the products of these rapidly growing players is that the rapidly growing players have not tackled the really tough analytics problems in the big data space. The moment they start to tackle big data mining problems requiring complex and recursive analytics, will they start to look more like SAS? If you think I am picking on SAS, swap out SAS for the IBM Cognos, SPSS, Netezza, Streams, Big Insights stack, and see how easy that is! Not to mention the price tag that comes with it.

What is certain is that these “new” players in the Statistical and BI spaces will do whatever they can to make advanced capabilities available to a broader audience than traditionally has been the case with SAS or SPSS (IBM). This will have the effect of making analytically enhanced insights more broadly available within organizations – that has to be a good thing.

Article Link and copy below

October 10, 2013

Libraries in Teradata Aster

I recently wrote (emphasis added):

My clients at Teradata Aster probably see things differently, but I don’t think their library of pre-built analytic packages has been a big success. The same goes for other analytic platform vendors who have done similar (generally lesser) things. I believe that this is because such limited libraries don’t do enough of what users want.

The bolded part has been, shall we say, confirmed. As Randy Lea tells it, Teradata Aster sales qualification includes the determination that at least one SQL-MR operator — be relevant to the use case. (“Operator” seems to be the word now, rather than “function”.) Randy agreed that some users prefer hand-coding, but believes a large majority would like to push work to data analysts/business analysts who might have strong SQL skills, but be less adept at general mathematical programming.

This phrasing will all be less accurate after the release of Aster 6, which extends Aster’s capabilities beyond the trinity of SQL, the SQL-MR library, and Aster-supported hand-coding.

Randy also said:

  • A typical Teradata Aster production customer uses 8-12 of the prebuilt functions (but now they seem to be called operators).
  • nPath is used in almost every Aster account. (And by now nPath has morphed into a family of about 5 different things.)
  • The Aster collaborative filtering operator is used in almost every account.
  • Ditto a/the text operator.
  • Several business intelligence vendors are partnering for direct access to selected Teradata Aster operators — mentioned were Tableau, TIBCO Spotfire, and Alteryx.
  • I don’t know whether this is on the strength of a specific operator or not, but Aster is used to help with predictive parts failure applications in multiple industries.

And Randy seemed to agree when I put words in his mouth to the effect that the prebuilt operators save users months of development time.

Meanwhile, Teradata Aster has started a whole new library for relationship analytics.

%d bloggers like this: