Tag Archives: Open Source

Evaluating Different Persistence Methods as part of the Planning Process

4 Jul

Every once in a while, I get asked about how to select between different types of databases. Generally, this comment is as a result of a product vendor or consultant making a recommendation to evolve towards a Big Data solution. The issue is twofold in that companies seek to understand what the next generation data platform looks like; AND, how or if their current environment can evolve. This involves understanding the pros and cons of the current product set and to what degree they can exist with newer approaches – Hadoop being the current platform people talk about.

The following is a list of data persistence approaches that helps at least define the options. This was done some time ago, so I am sure the vendors shown have evolved. However, think of it as a starting point to frame the discussion.

In general, one wants to anchor these discussions in some defined criteria that can help frame the discussion within the context of  business drivers. In the following figure, the goal is to show that as data sources and consumers of your data expand to include increasingly complex data structures and “contexts,” there is a need to evolve approaches beyond the traditional relational database (RDBMS) approaches. Different organizations will have different criteria. I provide this as a rubric that has worked before – you will need to create an approach that works for your organization or client.

Evolution of Data Persistence

A number of data persistence approaches  support the functional components as defined. These are described below.

Defined Pros / Cons Vendor examples
Relational Databases

(Row Orientation)

Traditional normalized data models optimized for efficiently storing data Relational structures are best used when the data structures are known and change infrequently. Relational designs often present challenges for analysts when queries and joins are executed that are incompatible with the design schema and / or indexing approach. This incompatibility creates processing bottlenecks, and resource challenges resulting in delays for data management teams. This approach is challenged when dealing with complex semantic data where multiple levels of parent / child relationships exist.

Advantages: This approach is best for transactional data where the relationships between the data and the use cases driving how data is accessed and used are stable. In uses where relational integrity is important and must be enforced in a consistent manner, this approach can work well. In a row based approach, contention on record locking are easier to manage than other methods.

Disadvantages: As the relationships between data and relational integrity are enforced through the application of a rigid data model, this approach is inflexible, and changes can be hard to  implement.

All major database vendors: IBM – DB2; Oracle; MS SQL and others
Columnar Databases

(Column Oriented)

Data organized  or indexed around columns; can be implemented in SQL or a NoSQL environments. Advantages: Columnar data designs lend themselves to analytical tasking involving large data sets where rapid search, retrieval and aggregation type queries are performed on large data tables. A columnar approach inherently creates vertical partitioning across the datasets stored this way. It is efficient and scalable.

Disadvantages: efficiencies can be offset by  the need to join many queries to obtain the desired result.

•Sybase IQ

•InfoBright

•Vertica (HP)

•Par Accel

•MS SQL 2012

Defined Pros / Cons Vendor examples
RDF Triple Stores / Databases Data stored organized around RDF triples (Actor-action-object OR Subject-predicate-Object); can be implemented in SQL or a NoSQL environments. Advantages: A semantic organization of data lends itself to analytical and knowledge management tasks where the understanding of complex and evolving relationships is key. This is especially the case where ontologies or SKOS (1) type relationships are required to organize entities and their relationships to one another: corporate hierarchies/networks; insider trading analysis for example. This approach to organizing data is often represented in the context of the “semantic web” whose organizing constructs are RDF and OWL. when dealing with complex semantic data where multiple levels of parent / child relationships exist, this approach is more efficient that RDBMS

Disadvantages: This approach to storing data is often not as efficient as relational approaches. It can be complicated to write queries to traverse complex networks – however, this is often not much easier in relational databases either.

Note: these can be implemented with XML  formatting or in some other form.

Native XML  / RDF Databases

•Marklogic (COTS)

•OpenLink Virtuoso (COTS)

•Stardog (o/s, COTS)

•BaseX  (o/s)

•eXist  (o/s)

•Sedna (o/s)

XML Enabled Databases

•IBM DB2

•MS SQL

•Oracle

•PostgrSQL

XML enabled databases deal with XML as a CLOB in a table or organized into tables based on a schema

Graph Databases A database that uses graph structures to store data. See XML / RDF Stores / Databases. Graph Databases are a variant on this theme.

Advantages:  Used primarily to store information on networks. Optimized for iterative joins; often in a recursive process (2)..

Disadvantages: Storage challenges – these are large datasets; builds through iterative joins – very processor intensive.

•ArangoDB

•OrientDB

•Cayley

•Aurelius Titan

•Aurelius Faunus

•Stardog

•Neo4J

•AllegroGraph

(1) SKOS = Simple Knowledge Organization Structure. Relationships can be expressed as triples; examples are “is part of”; “is similar to”

(2) Recursion versus iteration

Defined Pros / Cons Vendor Examples
NoSQL

File based storage – HDFS

Data structured to expose  insights through the use of “key pairs” This has many of the characteristics of the XML, Columnar and Graph approaches. In this instance, the data is loaded, and key value pair (KVP) files created external to the data. Think of the KVP as an index with a pointer back to the source data. This approach is generally associated with the Hadoop / MapReduce  capabilities, and the definition here assumes that KVP files are queried using the capabilities available in the Hadoop ecosystem

Advantages: flexibility; MPP capabilities; speed; schema-less; scalable; Great at creating views of data; and performing simple calculations across Big Data; significant open source community – especially through the Apache Foundation. Shared nothing architecture optimizes the read process. However, it creates challenges in meeting ACID (1) requirements. File based storage systems adhere to the BASE (2) requirements

Disadvantages: Share nothing architecture creates complexity in uses where sequencing of transactions or writing data is important – especially when multiple nodes are involved; complex metadata requirement; few tool “packages” available to support production environments; relatively immature product set.

Document Store

•Mongo DB

•Couch DB

Column Store

•Cassandra

•Hbase

•Accumulo

Key Value Pair

•Redis

•Riak

(1) ACID = Atomicity; Consistent; Isolated; Durable. Used for Transaction processing systems.

(2) BASE = Basic Availability, Soft State; Eventual Consistency. Used for distributed parallel processing systems where maintaining complete consistency is often prohibitively expensive

Defined Pros / Cons Vendor examples
In-Memory Approaches Data approaches where the data is loaded into active memory to improve efficiency Note that multiple persistence approaches can be implemented in memory

Advantages: Speed; flexibility – ability to virtualize views and calculated / derived tables; think of Datamarts in the traditional BI context

Disadvantages: Hardware, cost

•SAP HANA

•SAS High Performance Analytics

•VoltDB

The classes of tools below are presented as they provide alternatives for capabilities that are likely to be required. Many of the capabilities are resident in some of the tool sets already discussed.
Data Virtualization The ability to produce tables or views without going through an ETL process Data  virtualization is a capability built into other products. Any In- Memory product inherently virtualizes data. Likewise a number of the Enterprise BI tools allow data – generally in the form of “cubes” to be virtualized. Denodo Technologies is the major pure play vendor. The others vendors generally provide products that are part of larger suites of tools. •Composite Software (Cisco)

•Denodo Technologies

•Informatica

•IBM

•MS

•SAP

•Oracle

Search Engines Data management components that are used to search structured and unstructured data Search engines and appliances perform functions as simple as indexing data, and as complex as Natural Language Processing (NLP) and entity extraction. They are referenced here as the functionality can be implemented as stand alone capability and may be considered as part of the overall capability stack. •Google Search Appliance

•Elastic Search

Defined Pros / Cons Vendor examples
Hybrid Approaches Data products that implement both SQL and NoSQL approaches These are traditional SQL database approaches that have been partnered with one or more of the approaches defined above. Teradata acquired Aster to create a “bolt on” to a traditional SQL Db; IBM has Db2/Netezza/Big Insights. SAS uses a file based storage system and has created “Access Modules” that work though Apache HIVE to apply analytics within either an HDFS environment, or the SAS environment.

Another hybrid approach is exemplified by Cassandra that incorporates elements of a data model within a HDFS based system.

One also sees organizations implementing HDFS / RDBMS solutions for different functions. For example acquiring, landing and staging data using an HDFS approach, and then once requirements and the business use is known creating structured data models to facilitate and control delivery

Advantages: Integrated solutions; ability to leverage legacy; more developed toolkits to support production operations. Compared to open source, production ready solutions require less configuration and code development.

Disadvantages: Tend to be costly; architecture tends to be inflexible – all or nothing mindset.

•Teradata

•EMC

•SAS

•IBM

•Cassandra (Apache)

Advertisement

Primer on Big Data, Hadoop and “In-memory” Data Clouds

25 Aug

This is a good article. There have been a number of articles recently on the hype of big data, but the fact of the matter is that the technology related to what people are calling “big data” is here to stay, and it is going to change the way complex problems are handled. This article provides an overview. For those looking for products, this has a good set of links.

This is a good companion piece to the articles by Wayne Eckerson referenced in this post

Slowly Sowly the various pieces to build out the Hadoop vision are coming together

27 Jun

This article talks about Hunk (Splunk + Hadoop). It is a good example of how the various pieces are coming together to enable the Hadoop vision to become reality. Where will the mainstream commercial off the shelf folks be? Many of the mainstream data vendors are not moving their code into Hadoop (as Hunk is), but rather moving the extract into their machines.

There are some folks, who believe that the COTS products will win in the end, as they are going to reduce the cost to operate – or the total cost of ownership – to the point that it does not matter how “free” the open source stuff is. On the other hand there are company’s that are going the other way, and creating non traditional support models around open source software. This started with Red Hat. However, Red Hat still has the same approach to licensing – no matter what you use in terms of support, you still pay the license fee – that sounds like Oracle redux to me. I have seen that in a number of other Open Source product vendors as well.  The new trend may be to support Open Source tools with a “pay for what you use” type menu. We will see how that goes. In the meantime, who names these products?

Analyst Desktop Binder – Interesting view of Social Media Exploitation

16 May

Interesting reading – especially if you have done work in the fusion centers

Much noise was made of the words that are searched within media – This is a pretty long list and what it says to me is that there must be a significant amount human intervention and I would think an awful lot of “noise”.

Hard to believe that this is that effective without knowing more about underlying capabilities, but my guess is that this is only a step above Googling those terms!

Gartner BI & Analytics Magic Quadrant is out…

10 Feb

The report can be obtained here. Along with some industry analysis hear

Well the top right quadrant is becoming a crowded place.

Gartner BI Quadrant 2013

I have not had time to really go over this and compare it to last year’s but the trends and challenges that we have been seeing are reflected in this report; some interesting points:

  1. All of the Enterprise level systems are reported to be hard to implement. This is not surprise – what always surprises me is that companies blame this on one company or another – they are all like that! It has to be one of the decision criterion when selecting one of the comprehensive tool sets.
  2. My sense is that IBM is coming along – and is in the running for the uber BI / Analytics company. However, the write up indicates that growth through acquisition is still happening. This has traditionally led to confusion in the product line and difficulty in implementation. This is especially the case when you implement in a big data or streaming environment.
  3. Tibco and Tableau continue to go head to head. I see Spotfire on top from a product perspective with its use of “R”, the purchase of Insightful and building on its traditional enterprise service bus business. HOWEVER, Gartner calls out the cost model as something that holds Spotfire back. This is especially true when compared to Tableau. My sense is that if TIBCO is selling an integrated solution, then they can embed the cost of the BI capabilities in the total purchase and this is how they are gaining traction. Regardless – Spotfire is a great product and TIBCO is set to do great things, but their price point sets them up against SAS and IBM, while their flagship component sets them up against Tableau at a lower price point. My own experience is that this knocks them out of the early stage activity, and hence they are often not “built in” to the later stage activity.
  4. SAS Continues to dominate where analytics and  Big Data are involved. However, interesting to note that Gartner calls out that they are having a hard time communicating business benefit. This is critical when you are selling a enterprise product at a premium price. Unlike IBM who can draw on components that span the enterprise, SAS has to build the enterprise value proposition on the analytics stack only – this is not a problem unique to SAS – building the value proposition for enterprise level analytics is tough.
  5. Tableau is the darling of the crowd and moves into the Gartner Leader’s Quadrant for the first time. The company has come out with a number of “Big Data” type features. They have connectors to Hadoop, and the article refers to in-memory and columnar databases. While these things are important, and the lack of them was holding the company back from entering certain markets, it is a bit at odds with their largest customer segment, and their traditional positioning approach. Addressing larger and a more integrated approach takes them more directly into the competitive sphere of the big guys (SAP, IBM and SAS), and also into the sphere of TIBCO Spotfire.
  6. It would be interesting to run the Gartner analysis along different use cases (Fraud, Risk Management, Consumer Market Analysis, etc.) In certain circles one hears much of companies like Palantir that has a sexy interface and might do well against Spotfire and Tableau, but is not included here.  Detica is another company that may do well. SAS would probably come out on top in certain areas especially with the new Visual Analytics component. There are probably other companies that have comprehensive BI solutions for particular markets – If anyone has information on these types of solutions, I would be interested in a comment.

More to follow – and there is probably much more to say as things continue to evolve at a pace!

Open Source versus COTS

26 Jan

Public Sector Big Data: 5 Ways Big Data Must Evolve in 2013

Much of this article rings true. However, the last section requires some explanation:

“One could argue that as open source goes in 2013, Big Data goes as well. If open source platforms and tools continue to address agency demands for security, scalability, and flexibility, benefits within from Big Data within and across agencies will increase exponentially. There are hundreds of thousands of viable open source technologies on the market today. Not all are suitable for agency requirements, but as agencies update and expand their uses of data, these tools offer limitless opportunities to innovate. Additionally, opting for open source instead of proprietary vendor solutions prevents an agency from being locked into a single vendor’s tool that it may at some point outgrow or find ill-suited for their needs.”

I take exception to this in that the decision to go open source versus COTS is really not that simple. It really depends on a number of things: the nature of your business; the resources you have available to you; and the enterprise platforms and legacy in place to name a few. If you implement a COTS tool improperly you can be locked into using that tool – just the same as if you implement an Open Source tool improperly.

How locked in you are to any tool is largely a question of how the solution is architected! Be smart and take your time ensuring that the logical architecture ensures the right level of abstraction that ensures a level of modularity; and thus flexibility. This article talks about agile BI architectures – we need to be thinking the same way system architectures.

My feeling is that we are headed to a world where COTS products work in conjunction with Open Source – currently there are many examples of COTS products that ship with Open Source components – how many products ship with a Lucene indexer for example?

%d bloggers like this: