Archive | Data Management RSS feed for this section

Gartner Magic Quadrant for Operational Database Management Systems is out

11 Nov

http://www.gartner.com/technology/reprints.do?id=1-1M9YEHW&ct=131028&st=sb

I had a conversation with some one the other day, and we agreed that there was no “front end” for hadoop / NO-SQL type data environments. This seems to be a big issue in terms of these systems taking front and center from an operational perspective. More to follow on this.

Analytics keeps moving closer to the data!

18 Oct

http://feedproxy.google.com/~r/dbms2/feed/~3/QOuK0EQFRzs/

Note the list of partners – all have a background in visualization and analyst driven capabilities – not big data munging. Where does this leave the companies that are neither visualization, nor database companies? Companies like SAS.

Databases & Analytics – what database approach works best?

1 Aug

Every once in a while the question comes up as to what is the “right” database for analytics. How do organizations move from their current data environments to environments that are able to support the needs of Big Data and Analytics? It was not too long ago that the predominant answer was a relational database; moreover these were often organized around a highly normalized structure that arranged the fields and tables of a relational database to minimize redundancy and dependency (See also).

These structures to a large extent existed to optimize database efficiencies – or sidestep inefficiencies –  in a world that was memory and / or hardware constrained;  think 20+ years ago. Many of these constraints no longer exist which has created more choices for practitioners in how to store data. This is especially true of data repositories that built to support analytics as a highly normalized structure is often inefficient and cumbersome for analytics. Matching the data design and management approaches to the need improves performance and reduces operational complexity and with it costs.

The table below lays out some of the approaches and where they might apply. Note, these are not mutually exclusive: one can persist semantic data in a relational database for example. Also, this is not exhaustive by any means. The table below provides a starting point for those that are considering how their data environments should evolve as they seek to move from their legacy environment to one that supports new demands created by the need for analytics.

Data Design Approach

Analytical Activity

Relational. In this context “relational” refers to data stored in rows. Relational structures are best used when the data structures are known and change infrequently. Relational designs often present challenges for analysts when queries and joins are executed that are incompatible with the design schema and / or indexing approach. This incompatibility creates processing bottlenecks, and resource challenges resulting in delays for data management teams. In the context of analytics the challenges associated with this form of data persistence are discussed in other posts, and a favorite Exploiting Big Data Strategies for Integrating with Hadoop by Wayne Eckerson; Published: June 1, 2012.
Columnar. Columnar data stores might also be considered relational. However, their orientation is around the column versus the row. Columnar data designs lend themselves to analytical tasking involving large data sets where rapid search and retrieval in large data tables is a priority. See previous post on how columnar databases work. A columnar approach inherently creates vertical partitioning across the datasets stored this way. Columnar DBs allow for retrieval of only a subset of the columns and some columnar DBs allow for processing data in a compressed form. All this minimizes I/O for large retrievals.
Semantic A semantic organization of data lends itself to analytical tasks where the understanding of complex and evolving relationships is key. This is especially the case where ontologies are required to organize entities and their relationships to one another: corporate hierarchies/networks; insider trading analysis for example. This approach to organizing data is often represented in the context of the “semantic web” whose organizing constructs are RDF and OWL.
File Based File based approaches such as those used in Hadoop and SAS systems lend themselves to situations where data must be acquired and landed. However, the required organizational structure or analytical “context” is not yet defined. Data can be landed with minimal processing and made available for analysis in relatively raw form. Under certain circumstances file based approaches can improve performance as they are more easily used in MPP (Massively Parallel Processing or distributed computing) environments. Performance improvements will exist where data size is very large, and functions performed are “embarrassingly parallel“, and can work on platforms that designed around a “shared nothing architecture“; which is the architecture supporting Hadoop. The links referenced within the Relational section above speak to why and when you use Hadoop:see recent posts, and Exploiting Big Data Strategies for Integrating with Hadoop.

There are a few interesting papers on the topic – somewhat dated but still useful:

Curt Monash’s site has a presentation worth looking at title: How to Select an Analytical Database. In general, Curt’s blog DBMS2is well worth tracking.

This deck presented by Mark Madsen at 2011 Strata Conference is both informative and amusing.

This is a bioinformatics deck that was interesting. It does not have a date on it. However, good information from a field that has driven developments in approaches to dealing with large complex data problems.

Columnar Databases Explained

18 Jul

June Tong has done a great job of explaining columnar databases in this narrated slide presentation. This compliments nicely the benchmarking work referenced in an earlier post on efficiencies using columnar databases.

Information Architecture – A Moving Target?

6 Jul

I am increasingly seeing articles that talk about the confusion in identifying and building out the right information architecture for the organization. The article here, and with a clip below talk to that point. This is a good thing. People seek simplicity, and are looking for the prescriptive approach: 1) build a data warehouse; 2) build some datamarts for the business folks; 3) get a BI tool and build reports. But this does not cut it as it is too rigid a structure for analysts, or other stakeholders that have to do more than pull reports. The industry has responded by – I am speaking in buzzwords here – by adding “sandboxes”; by adding ODS (Operational Data Stores); and by adding a whole new way of landing, staging, persisting data and using it in analytical tasks (Hadoop). Sitting on top of this data level of the information architecture has been an explosion of tools that cater to (more buzzwords) data visualization, self serve BI, and data mashups to name a few.

Bottom line – how does this all get put together without creating an even bigger data mess than when you started? It is hard. What one sees so often is organizations putting off addressing the issue until they have a real problem. At this point, one sees a lot of sub-optimal management behavior. A consistent theme in the press is agility – organizations and their leaders need to embrace the agile manifesto. I am whole heartedly behind this. HOWEVER, agility needs to be framed within a plan, a vision, or at least some articulated statement of  an end point.

The article below is interesting as it presents agility as a key “must have”  management approach, and yet it also discusses the fact that in order for an agile approach to be successful, it needs to adopt disciplines that are decidedly un-agile! This creates a dual personality for leaders within the data management related functions of an organization (BI, analytics, ERP, …). On the one hand one wants to unleash the power of the tools and the creative intellect that is resident within the organization; on the other, there exists a desire to control, to reduce the noise around data, to simplify ones life. The answer is to embrace both – build a framework that provides long term guidance, and iteratively delivers capabilities within that framework towards a goal that is defined in terms of business capabilities – NOT technology or tightly defined tactical goals.

The framework – whichever approach one chooses will articulate the information architecture of the organization – how data flows around the organization to feed core business activities, and advance management’s goals! It is important – if it cannot be explained on a one page graphic, it is probably too complicated!

Martin’s approach to tying things together is below…

“”So given that there is not a one size fits all approach anymore – how does a company ensure its Information Architecture is developed and deployed correctly? Well, you have to build it from the ground up, and you have to keep updating it as the business requirements and implemented systems change. However, to do this effectively, the organisation must be cognisant of separating related workloads and host data on relevant and appropriate platforms, which are then tied together by certain elements, including:

See also:

  1. Polyglot persistence
  2. Data Management Maturity Model as an example of a way to start thinking about governance
  3. Agile development – a good idea so often badly implemented!

Slowly Sowly the various pieces to build out the Hadoop vision are coming together

27 Jun

This article talks about Hunk (Splunk + Hadoop). It is a good example of how the various pieces are coming together to enable the Hadoop vision to become reality. Where will the mainstream commercial off the shelf folks be? Many of the mainstream data vendors are not moving their code into Hadoop (as Hunk is), but rather moving the extract into their machines.

There are some folks, who believe that the COTS products will win in the end, as they are going to reduce the cost to operate – or the total cost of ownership – to the point that it does not matter how “free” the open source stuff is. On the other hand there are company’s that are going the other way, and creating non traditional support models around open source software. This started with Red Hat. However, Red Hat still has the same approach to licensing – no matter what you use in terms of support, you still pay the license fee – that sounds like Oracle redux to me. I have seen that in a number of other Open Source product vendors as well.  The new trend may be to support Open Source tools with a “pay for what you use” type menu. We will see how that goes. In the meantime, who names these products?

Seven Questions to Ask Your Data Geeks

13 Jun

This is a good article on some of the basics that are so often not asked.

I get disagreement, but for the most part big mistakes in analytics projects stem from the answers to these questions not being answered – rarely do they stem from the statistician getting the math wrong

Efficiencies with Columnar Databases

10 Jun

A little while ago, I posted an entry from Andrew’s Crabtree Analytics blog  where he had posted data on a test he did showing significant improvements on performance related to the use of columnar data structures when doing large joins. He has recently updated the test with significantly larger data volumes, and the results are posted here – still a big improvement.

 

 

Aggregate Persistence &; Polyglot Persistence!

9 Jun

Gotta love the consultant speak!!

This short article provides an interesting perspective on how NoSQL differs from a data storage perspective, and why that is important. The article also points out that storing data on large clusters is very efficient from a storage perspective, but NOT if the data is relational in nature. In order to look at data across clusters efficiently, one needs to reorganize the data – this is where MapReduce comes in. Mapreduce is great at reorganizing data to feed a particular tasks – from my perspective a critical need for the analytical communities.

This links to a notion of “Polyglot Persistence” which accepts the notion that data will be stored in multiple mediums as new ways of persisting data evolve. I find this interesting as this mirrors what we are seeing today. Customers have Operational Data Stores – usually relational, and yet seek to perform tasks that are complicated by: 1) the size of the data, and 2) the constraints placed on how the data can be evaluated or analyzed by the data model or architecture. This motivates an exploration of new approaches; hence the discussions industry is having on NoSQL (or to use the buzzwords: Hadoop; Mapreduce; Big Data).

I may have simplified this a bit – apologies. At the end of the day, we are seeing a sea change in how organizations deal with data to more effectively apply it to the diverse needs demanded by the business side of the house. Explaining how organizations must change, but do so in a controlled risk reduced manner is the challenge.

See also:

The story is what counts!! Data Scientists Draw Pictures and Tell Short Stories – Data Science Central

8 Apr

http://www.datasciencecentral.com/m/blogpost?id=6448529%3ABlogPost%3A60854