Data Stores | analyticaltern

Archive | Data Stores RSS feed for this section

A good baseline presentation on creating semantic data repositories

I have been meaning to give a shout out to Dan MCCreary http://www.danmccreary.com/ for some time now, and am only just getting around to it. What caught my attention was a presentation Dan gave at the 2006 Semantic Technology Conference. The presentation walked the audience through the process of creating a well-structured data repository enabling players within the Minnesota education system to ask complex across all of the data available. While the presentation is rather dated, it is one of the better presentations I have seen that takes things back to a level most people can understand. Additionally, I don’t see that the state of the industry has changed that much in terms of tools to support the creation well-structured data repositories that support broad “knowledge management” goals.

Dan focuses on metadata as the means to achieve the goals of the participants: Minnesota Department of Education; the Wisconsin Department of Public Instruction and the Michigan Department of Education. The goal is to create a semantic understanding of the data available. Semantic is one of those words that means many things to different people. Once you combine “semantic” with “knowledge management” or ”knowledge discovery”, everyone seems to have a different opinion as to what you are talking about – IT and business folks alike. It becomes very important to level set your audience. For the purposes of this discussion, creating a semantic repository involves:

Enhancing data with metadata that focuses on describing the data asset as a whole, and with respect to its component content parts;
Creating descriptive and content metadata that describes the data in specific terms, and in more general “conceptual” terms.
Creating a description of the data that reveals connections to other concepts; related terms; multiple standards.
Creating highly linked data that is organized through taxonomies and ontologies to reveal links that are perhaps not obvious to users.

The desired end state is the ability for a broad range of end users to query disparate data sources and understand how available date can be used in their particular context.

Within the referenced deck (http://www.danmccreary.com/presentations/semweb2006/), there are a number of sections worth drawing your attention to:

ISO 11179 (Starting on page 15). Most people I speak to that know about ISO 11179 (http://www.iso.org/iso/home/search.htm?qt=11179&published=on&active_tab=standards&sort_by=rel ) are somewhat skeptical as the spec is a heavy on as written. Once you get your head around all the good ideas that have gone onto the specification, the question becomes how much of it do you adopt? The idea behind the spec is that each piece of metadata has its own metadata that allows a user to know exactly how that metadata is to be used. Dan presents the 11179 concept within his discussion of the National Information Exchange Model (NIEMS https://www.niem.gov/Pages/default.aspx) which brings us to another interesting point…

Mapping to Standards (Page 16). I like the way that the process of mapping to standards is presented. In this instance, the Educational System is only trying to map to one standard. However, you can imagine an environment where there are multiple standards of interest. In this instance, the mapping approach is organized around the OWL “SameAs” terminology (Web Ontology Language http://www.w3.org/TR/owl-features/). It does not say it in the presentation, but I am assuming that this is taken from the SKOS Simple Knowledge Organization System standards (http://www.w3.org/2004/02/skos/). As you classify data assets, there will be multiple mapping exercises where the mapping will not be exact. The linking concepts defined within SKOS will allow you to relate items that are not precise or consistent in their relationships.

Demonstrating that the data problem grows faster than the number of data sources (Page 17). This sounds so simple. However, I am still surprised when I hear people say – “we have the data mapped between the sources I want, so no need to spend time mapping to standards”. Well ok that works if you have a small number of static data sets, but … that happens rarely. In the presentation Dan refers to this as the O(N²) problem versus the O(N) problem. I have included his graphics as these tell the story better than words.

Tags: information Sharing, ISO 11179, metadata, metamodel, products, semantic

Comments Leave a Comment
Categories Data Management, Data Stores, Master Data, Metadata, methodologies
Author analyticaltern

Gartner Magic Quadrant for Operational Database Management Systems is out

11 Nov

http://www.gartner.com/technology/reprints.do?id=1-1M9YEHW&ct=131028&st=sb

I had a conversation with some one the other day, and we agreed that there was no “front end” for hadoop / NO-SQL type data environments. This seems to be a big issue in terms of these systems taking front and center from an operational perspective. More to follow on this.

Comments Leave a Comment
Categories Data Management, Data Stores, RDBMS
Author analyticaltern

Analytics keeps moving closer to the data!

18 Oct

http://feedproxy.google.com/~r/dbms2/feed/~3/QOuK0EQFRzs/

Note the list of partners – all have a background in visualization and analyst driven capabilities – not big data munging. Where does this leave the companies that are neither visualization, nor database companies? Companies like SAS.

Tags: industry, sas teradata aster, trends

Comments Leave a Comment
Categories Big Data, Data Stores, Industry, Products
Author analyticaltern

Databases & Analytics – what database approach works best?

1 Aug

Every once in a while the question comes up as to what is the “right” database for analytics. How do organizations move from their current data environments to environments that are able to support the needs of Big Data and Analytics? It was not too long ago that the predominant answer was a relational database; moreover these were often organized around a highly normalized structure that arranged the fields and tables of a relational database to minimize redundancy and dependency (See also).

These structures to a large extent existed to optimize database efficiencies – or sidestep inefficiencies – in a world that was memory and / or hardware constrained; think 20+ years ago. Many of these constraints no longer exist which has created more choices for practitioners in how to store data. This is especially true of data repositories that built to support analytics as a highly normalized structure is often inefficient and cumbersome for analytics. Matching the data design and management approaches to the need improves performance and reduces operational complexity and with it costs.

The table below lays out some of the approaches and where they might apply. Note, these are not mutually exclusive: one can persist semantic data in a relational database for example. Also, this is not exhaustive by any means. The table below provides a starting point for those that are considering how their data environments should evolve as they seek to move from their legacy environment to one that supports new demands created by the need for analytics.

Data Design Approach	Analytical Activity
Relational. In this context “relational” refers to data stored in rows.	Relational structures are best used when the data structures are known and change infrequently. Relational designs often present challenges for analysts when queries and joins are executed that are incompatible with the design schema and / or indexing approach. This incompatibility creates processing bottlenecks, and resource challenges resulting in delays for data management teams. In the context of analytics the challenges associated with this form of data persistence are discussed in other posts, and a favorite Exploiting Big Data Strategies for Integrating with Hadoop by Wayne Eckerson; Published: June 1, 2012.
Columnar. Columnar data stores might also be considered relational. However, their orientation is around the column versus the row.	Columnar data designs lend themselves to analytical tasking involving large data sets where rapid search and retrieval in large data tables is a priority. See previous post on how columnar databases work. A columnar approach inherently creates vertical partitioning across the datasets stored this way. Columnar DBs allow for retrieval of only a subset of the columns and some columnar DBs allow for processing data in a compressed form. All this minimizes I/O for large retrievals.
Semantic	A semantic organization of data lends itself to analytical tasks where the understanding of complex and evolving relationships is key. This is especially the case where ontologies are required to organize entities and their relationships to one another: corporate hierarchies/networks; insider trading analysis for example. This approach to organizing data is often represented in the context of the “semantic web” whose organizing constructs are RDF and OWL.
File Based	File based approaches such as those used in Hadoop and SAS systems lend themselves to situations where data must be acquired and landed. However, the required organizational structure or analytical “context” is not yet defined. Data can be landed with minimal processing and made available for analysis in relatively raw form. Under certain circumstances file based approaches can improve performance as they are more easily used in MPP (Massively Parallel Processing or distributed computing) environments. Performance improvements will exist where data size is very large, and functions performed are “embarrassingly parallel“, and can work on platforms that designed around a “shared nothing architecture“; which is the architecture supporting Hadoop. The links referenced within the Relational section above speak to why and when you use Hadoop:see recent posts, and Exploiting Big Data Strategies for Integrating with Hadoop.

There are a few interesting papers on the topic – somewhat dated but still useful:

Curt Monash’s site has a presentation worth looking at title: How to Select an Analytical Database. In general, Curt’s blog DBMS2is well worth tracking.

This deck presented by Mark Madsen at 2011 Strata Conference is both informative and amusing.

Determine the Right Analytic Database: A Survey of New Data Technologies from mark madsen

This is a bioinformatics deck that was interesting. It does not have a date on it. However, good information from a field that has driven developments in approaches to dealing with large complex data problems.

Tags: Big Data, Data Persistence, database

Comments Leave a Comment
Categories Best Practices, Big Data, Data Management, Data Stores
Author analyticaltern

Columnar Databases Explained

18 Jul

June Tong has done a great job of explaining columnar databases in this narrated slide presentation. This compliments nicely the benchmarking work referenced in an earlier post on efficiencies using columnar databases.

Demystifying Columnar Databases from June Tong

Tags: Columnar

Comments Leave a Comment
Categories Data Stores
Author analyticaltern

Efficiencies with Columnar Databases

10 Jun

A little while ago, I posted an entry from Andrew’s Crabtree Analytics blog where he had posted data on a test he did showing significant improvements on performance related to the use of columnar data structures when doing large joins. He has recently updated the test with significantly larger data volumes, and the results are posted here – still a big improvement.

Tags: Columnar, Columnstoreindex, products, SQL Server, tes

Comments Leave a Comment
Categories Big Data, Data Management, Data Stores, Products
Author analyticaltern

Aggregate Persistence &; Polyglot Persistence!

9 Jun

Gotta love the consultant speak!!

This short article provides an interesting perspective on how NoSQL differs from a data storage perspective, and why that is important. The article also points out that storing data on large clusters is very efficient from a storage perspective, but NOT if the data is relational in nature. In order to look at data across clusters efficiently, one needs to reorganize the data – this is where MapReduce comes in. Mapreduce is great at reorganizing data to feed a particular tasks – from my perspective a critical need for the analytical communities.

This links to a notion of “Polyglot Persistence” which accepts the notion that data will be stored in multiple mediums as new ways of persisting data evolve. I find this interesting as this mirrors what we are seeing today. Customers have Operational Data Stores – usually relational, and yet seek to perform tasks that are complicated by: 1) the size of the data, and 2) the constraints placed on how the data can be evaluated or analyzed by the data model or architecture. This motivates an exploration of new approaches; hence the discussions industry is having on NoSQL (or to use the buzzwords: Hadoop; Mapreduce; Big Data).

I may have simplified this a bit – apologies. At the end of the day, we are seeing a sea change in how organizations deal with data to more effectively apply it to the diverse needs demanded by the business side of the house. Explaining how organizations must change, but do so in a controlled risk reduced manner is the challenge.

Another reason why Data Management and Analysts cannot lead separate lives

14 Feb

Another reason why Data Management and Analysts cannot lead separate lives

I found this article interesting in that it points out why the bridge between the data side of the house and the analytical side must be well established – if the data team implements a design that does not support analytics, it has material impacts. I know this is blindingly obvious, but ….

I have recently been in a number of discussions where the attitude was we are going to build the data warehouse using best practices and years of experience, and it really does not matter what you are going to do with the data. I know it crazy, but… you know what I am talking about – we see it all the time.

The article itself tests performance on a columnar versus relational approach to persisting data, and has some surprising results – 4,100% improvement! I would be interested in other studies that have looked at the difference between different data architectures when performing analytical tasks.

Tags: Big Data, Columnar, Columnstoreindex, Infinidb, products, SQL Server, testing

Comments Leave a Comment
Categories Big Data, Data Management, Data Stores, Products
Author analyticaltern

Search

analyticaltern

Gartner Magic Quadrant for Operational Database Management Systems is out

Analytics keeps moving closer to the data!

Databases & Analytics – what database approach works best?

Columnar Databases Explained

Efficiencies with Columnar Databases

Aggregate Persistence &; Polyglot Persistence!

Another reason why Data Management and Analysts cannot lead separate lives

Recent Posts

Archives

Follow Blog via Email

Interesting Tags

Wayne Erikson

Pages

Search

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Recent Posts

Archives

Follow Blog via Email

Pages