Evaluating Different Persistence Methods as part of the Planning Process

4 Jul

Every once in a while, I get asked about how to select between different types of databases. Generally, this comment is as a result of a product vendor or consultant making a recommendation to evolve towards a Big Data solution. The issue is twofold in that companies seek to understand what the next generation data platform looks like; AND, how or if their current environment can evolve. This involves understanding the pros and cons of the current product set and to what degree they can exist with newer approaches – Hadoop being the current platform people talk about.

The following is a list of data persistence approaches that helps at least define the options. This was done some time ago, so I am sure the vendors shown have evolved. However, think of it as a starting point to frame the discussion.

In general, one wants to anchor these discussions in some defined criteria that can help frame the discussion within the context of  business drivers. In the following figure, the goal is to show that as data sources and consumers of your data expand to include increasingly complex data structures and “contexts,” there is a need to evolve approaches beyond the traditional relational database (RDBMS) approaches. Different organizations will have different criteria. I provide this as a rubric that has worked before – you will need to create an approach that works for your organization or client.

Evolution of Data Persistence

A number of data persistence approaches  support the functional components as defined. These are described below.

Defined Pros / Cons Vendor examples
Relational Databases

(Row Orientation)

Traditional normalized data models optimized for efficiently storing data Relational structures are best used when the data structures are known and change infrequently. Relational designs often present challenges for analysts when queries and joins are executed that are incompatible with the design schema and / or indexing approach. This incompatibility creates processing bottlenecks, and resource challenges resulting in delays for data management teams. This approach is challenged when dealing with complex semantic data where multiple levels of parent / child relationships exist.

Advantages: This approach is best for transactional data where the relationships between the data and the use cases driving how data is accessed and used are stable. In uses where relational integrity is important and must be enforced in a consistent manner, this approach can work well. In a row based approach, contention on record locking are easier to manage than other methods.

Disadvantages: As the relationships between data and relational integrity are enforced through the application of a rigid data model, this approach is inflexible, and changes can be hard to  implement.

All major database vendors: IBM – DB2; Oracle; MS SQL and others
Columnar Databases

(Column Oriented)

Data organized  or indexed around columns; can be implemented in SQL or a NoSQL environments. Advantages: Columnar data designs lend themselves to analytical tasking involving large data sets where rapid search, retrieval and aggregation type queries are performed on large data tables. A columnar approach inherently creates vertical partitioning across the datasets stored this way. It is efficient and scalable.

Disadvantages: efficiencies can be offset by  the need to join many queries to obtain the desired result.

•Sybase IQ

•InfoBright

•Vertica (HP)

•Par Accel

•MS SQL 2012

Defined Pros / Cons Vendor examples
RDF Triple Stores / Databases Data stored organized around RDF triples (Actor-action-object OR Subject-predicate-Object); can be implemented in SQL or a NoSQL environments. Advantages: A semantic organization of data lends itself to analytical and knowledge management tasks where the understanding of complex and evolving relationships is key. This is especially the case where ontologies or SKOS (1) type relationships are required to organize entities and their relationships to one another: corporate hierarchies/networks; insider trading analysis for example. This approach to organizing data is often represented in the context of the “semantic web” whose organizing constructs are RDF and OWL. when dealing with complex semantic data where multiple levels of parent / child relationships exist, this approach is more efficient that RDBMS

Disadvantages: This approach to storing data is often not as efficient as relational approaches. It can be complicated to write queries to traverse complex networks – however, this is often not much easier in relational databases either.

Note: these can be implemented with XML  formatting or in some other form.

Native XML  / RDF Databases

•Marklogic (COTS)

•OpenLink Virtuoso (COTS)

•Stardog (o/s, COTS)

•BaseX  (o/s)

•eXist  (o/s)

•Sedna (o/s)

XML Enabled Databases

•IBM DB2

•MS SQL

•Oracle

•PostgrSQL

XML enabled databases deal with XML as a CLOB in a table or organized into tables based on a schema

Graph Databases A database that uses graph structures to store data. See XML / RDF Stores / Databases. Graph Databases are a variant on this theme.

Advantages:  Used primarily to store information on networks. Optimized for iterative joins; often in a recursive process (2)..

Disadvantages: Storage challenges – these are large datasets; builds through iterative joins – very processor intensive.

•ArangoDB

•OrientDB

•Cayley

•Aurelius Titan

•Aurelius Faunus

•Stardog

•Neo4J

•AllegroGraph

(1) SKOS = Simple Knowledge Organization Structure. Relationships can be expressed as triples; examples are “is part of”; “is similar to”

(2) Recursion versus iteration

Defined Pros / Cons Vendor Examples
NoSQL

File based storage – HDFS

Data structured to expose  insights through the use of “key pairs” This has many of the characteristics of the XML, Columnar and Graph approaches. In this instance, the data is loaded, and key value pair (KVP) files created external to the data. Think of the KVP as an index with a pointer back to the source data. This approach is generally associated with the Hadoop / MapReduce  capabilities, and the definition here assumes that KVP files are queried using the capabilities available in the Hadoop ecosystem

Advantages: flexibility; MPP capabilities; speed; schema-less; scalable; Great at creating views of data; and performing simple calculations across Big Data; significant open source community – especially through the Apache Foundation. Shared nothing architecture optimizes the read process. However, it creates challenges in meeting ACID (1) requirements. File based storage systems adhere to the BASE (2) requirements

Disadvantages: Share nothing architecture creates complexity in uses where sequencing of transactions or writing data is important – especially when multiple nodes are involved; complex metadata requirement; few tool “packages” available to support production environments; relatively immature product set.

Document Store

•Mongo DB

•Couch DB

Column Store

•Cassandra

•Hbase

•Accumulo

Key Value Pair

•Redis

•Riak

(1) ACID = Atomicity; Consistent; Isolated; Durable. Used for Transaction processing systems.

(2) BASE = Basic Availability, Soft State; Eventual Consistency. Used for distributed parallel processing systems where maintaining complete consistency is often prohibitively expensive

Defined Pros / Cons Vendor examples
In-Memory Approaches Data approaches where the data is loaded into active memory to improve efficiency Note that multiple persistence approaches can be implemented in memory

Advantages: Speed; flexibility – ability to virtualize views and calculated / derived tables; think of Datamarts in the traditional BI context

Disadvantages: Hardware, cost

•SAP HANA

•SAS High Performance Analytics

•VoltDB

The classes of tools below are presented as they provide alternatives for capabilities that are likely to be required. Many of the capabilities are resident in some of the tool sets already discussed.
Data Virtualization The ability to produce tables or views without going through an ETL process Data  virtualization is a capability built into other products. Any In- Memory product inherently virtualizes data. Likewise a number of the Enterprise BI tools allow data – generally in the form of “cubes” to be virtualized. Denodo Technologies is the major pure play vendor. The others vendors generally provide products that are part of larger suites of tools. •Composite Software (Cisco)

•Denodo Technologies

•Informatica

•IBM

•MS

•SAP

•Oracle

Search Engines Data management components that are used to search structured and unstructured data Search engines and appliances perform functions as simple as indexing data, and as complex as Natural Language Processing (NLP) and entity extraction. They are referenced here as the functionality can be implemented as stand alone capability and may be considered as part of the overall capability stack. •Google Search Appliance

•Elastic Search

Defined Pros / Cons Vendor examples
Hybrid Approaches Data products that implement both SQL and NoSQL approaches These are traditional SQL database approaches that have been partnered with one or more of the approaches defined above. Teradata acquired Aster to create a “bolt on” to a traditional SQL Db; IBM has Db2/Netezza/Big Insights. SAS uses a file based storage system and has created “Access Modules” that work though Apache HIVE to apply analytics within either an HDFS environment, or the SAS environment.

Another hybrid approach is exemplified by Cassandra that incorporates elements of a data model within a HDFS based system.

One also sees organizations implementing HDFS / RDBMS solutions for different functions. For example acquiring, landing and staging data using an HDFS approach, and then once requirements and the business use is known creating structured data models to facilitate and control delivery

Advantages: Integrated solutions; ability to leverage legacy; more developed toolkits to support production operations. Compared to open source, production ready solutions require less configuration and code development.

Disadvantages: Tend to be costly; architecture tends to be inflexible – all or nothing mindset.

•Teradata

•EMC

•SAS

•IBM

•Cassandra (Apache)

Advertisements

Formalizing and optimizing your risk architecture within the BCBS context

1 Jul

This article was interesting to me for two reasons: 1) It formalized a data view of risk management within the financial community around the context of BCBS (and for BCBS 239 related to Risk Data Aggregation and Reporting); and, 2) it provided an interesting perspective on governance / data quality and associated metrics.

This paragraph sums it up:

“What is actually happening in practice is that each major institution’s regional banks are lobbying/negotiating with their local/regional regulators to agree on an initial form of compliance – typically as some form of MS Word or PowerPoint presentation to prove that that they have an understanding of their risk data architecture. One organization might write up 100 page tome to show its understanding – and another might write up 10. The “ask” is vague and the interpretation is subjective. Just what is adequate?”

The perspective on governance that  the article proposed is a way to “systematically compare different architectures” with a set of metrics that was understandable, obtainable, and actionable.

Have a read – let me know your thoughts. The article also provides a nice summary of the BCBS requirements.

Old School vs. new school – Its Both!

24 Oct

Excellent article by Wayne Eckerson (most of his are) . We give Data Warehouses a bad name because they have been implemented in a way that does not meet the businesses needs – certainly not from an analytical perspective. HOWEVER, the business reasons that they exist remain, and this is Wayne’s point. I have been watching the shouting match between Inmon and Kimball.  I think they are both wrong – the answer is not as simple as they make it out to be – our world will be hybrid SQL/RDBMS and NoSQL and everything will need to play nice together! Those are my words  of wisdom on a Friday 🙂

Interesting thought process to identify analytical approaches

8 Jul

analyticaltern

Courtesy of a colleague in the medical data management world – check out this graphic. It is missing a few approaches, but lays out the thought process well.

Machine Learning - Cheastsheet

The Booz Allen Field Guide to Data Science has a similar linkage that is useful. That book can be downloaded here

While I am at at, I found this good book Managing Research Data by Graham-Pryor that focuses on managing research data. I continue to be surprised at the approaches taken by “traditional” data management folks to feed the analytical processes. The old school way of dealing with analytics data did not work well which has created some of the organizational work arounds that exist in companies. This only gets worse when dealing with large amounts of data, and data that must work across systems / sources.

View original post

A comparison of programming languages in economics

8 Jul

Interesting comparison of programming language speeds. Given that the big data world seems to be all about Python, I wonder if folks start doing complicated calculations over big data if they will move away from Python? SAS is apparently working on “Accelerators” to work on hadoop nodes which appear to address this same problem. They already have them for Databases and Db appliances.

The above makes sense if you consider that for the most part “big Data” is about folks doing simple calculations in parallel  over many data nodes.

The thread of comments below the article are also interesting.

===================================

There is a new NBER working paper with that title, by S. Borağan Aruoba and Jesus Fernandez-Villaverde. Here is the abstract:

We solve the stochastic neoclassical growth model, the workhorse of modern macroeconomics, using C++11, Fortran 2008, Java, Julia, Python, Matlab, Mathematica, and R. We implement the same algorithm, value function iteration with grid search, in each of the languages. We report the execution times of the codes in a Mac and in a Windows computer and comment on the strength and weakness of each language.

Here are their results:

1. C++ and Fortran are still considerably faster than any other alternative, although one needs to be careful with the choice of compiler.

2. C++ compilers have advanced enough that, contrary to the situation in the 1990s and some folk wisdom, C++ code runs slightly faster (5-7 percent) than Fortran code.

3. Julia, with its just-in-time compiler, delivers outstanding per formance. Execution speed is only between 2.64 and 2.70 times the execution speed of the best C++ compiler.

4. Baseline Python was slow. Using the Pypy implementation, it runs around 44 times slower than in C++. Using the default CPython interpreter, the code runs between 155 and 269 times slower than in C++.

5. However, a relatively small rewriting of the code and the use of Numba (a just-in-time compiler for Python that uses decorators) dramatically improves Python ’s performance: the decorated code runs only between 1.57 and 1.62 times slower than the best C++ executable.

6.Matlab is between 9 to 11 times slower than the best C++ executable. When combined with Mex files, though, the difference is only 1.24 to 1.64 times.

7. R runs between 500 to 700 times slower than C++ . If the code is compiled, the code is between 240 to 340 times slower.

8. Mathematica can deliver excellent speed, about four times slower than C++, but only after a considerable rewriting of the code to take advantage of the peculiarities of the language. The baseline version our algorithm in Mathematica is much slower, even after taking advantage of Mathematica compilation.

There are ungated copies and some discussion here.

 

 

The more things change, the more they stay the same…

16 Jun

Very amusing article – totally agree that the addition of a CDO if too broadly defined begins to confuse things. Is the CDO part of an integrated strategy to move things ahead, or a tactical band-aid response to an organizational problem?

 

Self Service BI

29 May

Good article on Self Serve BI. The term has been around a while, but never seems to get old.

Interesting thought process to identify analytical approaches

29 Jan

Courtesy of a colleague in the medical data management world – check out this graphic. It is missing a few approaches, but lays out the thought process well.

Machine Learning - Cheastsheet

The Booz Allen Field Guide to Data Science has a similar linkage that is useful. That book can be downloaded here

While I am at at, I found this good book Managing Research Data by Graham-Pryor that focuses on managing research data. I continue to be surprised at the approaches taken by “traditional” data management folks to feed the analytical processes. The old school way of dealing with analytics data did not work well which has created some of the organizational work arounds that exist in companies. This only gets worse when dealing with large amounts of data, and data that must work across systems / sources.

%d bloggers like this: