Archive | Industry RSS feed for this section

Gartner Magic Quadrant for Operational Database Management Systems is out

11 Nov

http://www.gartner.com/technology/reprints.do?id=1-1M9YEHW&ct=131028&st=sb

I had a conversation with some one the other day, and we agreed that there was no “front end” for hadoop / NO-SQL type data environments. This seems to be a big issue in terms of these systems taking front and center from an operational perspective. More to follow on this.

Analytics keeps moving closer to the data!

18 Oct

http://feedproxy.google.com/~r/dbms2/feed/~3/QOuK0EQFRzs/

Note the list of partners – all have a background in visualization and analyst driven capabilities – not big data munging. Where does this leave the companies that are neither visualization, nor database companies? Companies like SAS.

Healthcare’s New Big Idea

14 Oct

Image

Once upon a time in the American healthcare system, big data was an unknown idea. Recognizing that healthcare costs rose unmanageably and healthcare quality varied dramatically without clear explanation, Congress introduced Managed Care with the hope that relying upon a for-profit business model would make the system more competitive, more comprehensive, and more effective. Now, over thirty years later, it appears that new changes in American healthcare will position “big data” as the driver of effectiveness and competitiveness. Here are a few reasons why.

When thinking about the government policy that will make big data essential in the new healthcare system, three main pieces of legislation come to mind – the obvious heavyweight of the group being the Affordable Care Act (ACA). By now, most know that by passing the ACA into law, the federal government shifts America away from a volume-based system of care (in which doctors and hospitals make money based on how many tests they run and treatments they try) to a value-based system in which doctors and hospitals receive rewards according to the value created for patients. However few know that in order to actualize this value-based system, the ACA directly implicates big data at federal and state levels of healthcare. For example, the ACA authorizes the Department of Health and Human Services (HHS) to release decades of stored data and make it usable, searchable, and ultimately analyzable by the health sector as a whole to promote transparency in markets for healthcare and health insurance. Here, the driver of transparency, and thus competitiveness and effectiveness, is clearly big data.

In other examples, the ACA uses language that endorses, if not mandates, big data use throughout the system. The ACA not only explicitly authorizes the Center for Disease Control (CDC) to “provide grants to states and large local health departments” to conduct “evidence-based” interventions, it creates a technical assistance programs to diffuse “evidence-based therapies” throughout the healthcare environment. Note that in the medical community, “evidence-based medicine” means making treatment decisions for individual patients based on data of the best scientific evidence available, rendering the use of this relatively new term an endorsement of big data in healthcare treatment. These pieces of evidence – in the form of direct references to big data at the federal level, state level, and patient level – strongly support the conclusion that the ACA creates a new system reliant upon big data for efficiency and competitiveness.

The remaining pieces of legislation further signal big data as the new lifeblood of the American healthcare environment. In 2009, the Open Government Directive, in combination with the Health Data Initiative implemented by HHS, called for agencies like the Food and Drug Administration (FDA), Center for Medicare & Medicaid Services (CMS), and CDC to liberate decades of data. The Health Information Technology for Economic and Clinical Health Act (HITECH), part of the 2009 American Recovery and Reinvestment Act, authorized over $39 billion in incentive payments for providers to use electronic medical records, with the goal of driving adoption up to 85% by 2019. Finally, to facilitate the exchange of information and accelerate the adoption of data reliance in the new health environment, CMS created the Office of Information Products and Data Analytics to oversee numerous databases and collaborate with the private sector. Among other functions, this office will oversee the over $550 million spent by HHS to create data clearinghouses – run by states – that will consolidate data from providers within the given state. All of this legislation, which essentially produces a giant slot for a big data peg to fill, paves the way for a new healthcare environment reliant upon rapid sharing, analysis, and synthesis of large quantities of community and national health data.

Now at this point, nearly four years after legislation supposedly opened the floodgates of big healthcare data to the private sector, the reader must wonder why more private sector companies haven’t taken advantage of an obvious market opportunity. The answer is: actually, a few first movers have.

Blue Cross / Blue Shield of California, working together with a company called Nant Health, has created an integrated technology system that allows hospitals, HMOs, and doctors to deliver evidence-based care to patients under their jurisdiction. This system catalyzes performance improvement, and thus revenue-generating value creation, across the system. The use of big data has also allowed some first movers to innovate and generate applications reliant upon newly liberated data. A company called Asthmapolis created a GPS-enabled tracker that monitors inhaler usage by asthmatics, directs the data into a central database used to detect macro-level trends, and merges the data with CDC data pertaining to known asthma triggers. These few cases illustrate that private sector engagement in this new market opportunity remains new, and diverse, and far from delimited.

The ACA has moved into its execution phase, and the introduction of the big data idea poses new and interesting challenges to how the American Healthcare system will evolve. Some challenges will bring about positive change, such as identification or clear opportunities for preventive care. Other challenges will bring negative change, such as the adverse effects transparency will likely have on certain patient groups. Regardless, it looks like big data is here to stay.

Primer on Big Data, Hadoop and “In-memory” Data Clouds

25 Aug

This is a good article. There have been a number of articles recently on the hype of big data, but the fact of the matter is that the technology related to what people are calling “big data” is here to stay, and it is going to change the way complex problems are handled. This article provides an overview. For those looking for products, this has a good set of links.

This is a good companion piece to the articles by Wayne Eckerson referenced in this post

Databases & Analytics – what database approach works best?

1 Aug

Every once in a while the question comes up as to what is the “right” database for analytics. How do organizations move from their current data environments to environments that are able to support the needs of Big Data and Analytics? It was not too long ago that the predominant answer was a relational database; moreover these were often organized around a highly normalized structure that arranged the fields and tables of a relational database to minimize redundancy and dependency (See also).

These structures to a large extent existed to optimize database efficiencies – or sidestep inefficiencies –  in a world that was memory and / or hardware constrained;  think 20+ years ago. Many of these constraints no longer exist which has created more choices for practitioners in how to store data. This is especially true of data repositories that built to support analytics as a highly normalized structure is often inefficient and cumbersome for analytics. Matching the data design and management approaches to the need improves performance and reduces operational complexity and with it costs.

The table below lays out some of the approaches and where they might apply. Note, these are not mutually exclusive: one can persist semantic data in a relational database for example. Also, this is not exhaustive by any means. The table below provides a starting point for those that are considering how their data environments should evolve as they seek to move from their legacy environment to one that supports new demands created by the need for analytics.

Data Design Approach

Analytical Activity

Relational. In this context “relational” refers to data stored in rows. Relational structures are best used when the data structures are known and change infrequently. Relational designs often present challenges for analysts when queries and joins are executed that are incompatible with the design schema and / or indexing approach. This incompatibility creates processing bottlenecks, and resource challenges resulting in delays for data management teams. In the context of analytics the challenges associated with this form of data persistence are discussed in other posts, and a favorite Exploiting Big Data Strategies for Integrating with Hadoop by Wayne Eckerson; Published: June 1, 2012.
Columnar. Columnar data stores might also be considered relational. However, their orientation is around the column versus the row. Columnar data designs lend themselves to analytical tasking involving large data sets where rapid search and retrieval in large data tables is a priority. See previous post on how columnar databases work. A columnar approach inherently creates vertical partitioning across the datasets stored this way. Columnar DBs allow for retrieval of only a subset of the columns and some columnar DBs allow for processing data in a compressed form. All this minimizes I/O for large retrievals.
Semantic A semantic organization of data lends itself to analytical tasks where the understanding of complex and evolving relationships is key. This is especially the case where ontologies are required to organize entities and their relationships to one another: corporate hierarchies/networks; insider trading analysis for example. This approach to organizing data is often represented in the context of the “semantic web” whose organizing constructs are RDF and OWL.
File Based File based approaches such as those used in Hadoop and SAS systems lend themselves to situations where data must be acquired and landed. However, the required organizational structure or analytical “context” is not yet defined. Data can be landed with minimal processing and made available for analysis in relatively raw form. Under certain circumstances file based approaches can improve performance as they are more easily used in MPP (Massively Parallel Processing or distributed computing) environments. Performance improvements will exist where data size is very large, and functions performed are “embarrassingly parallel“, and can work on platforms that designed around a “shared nothing architecture“; which is the architecture supporting Hadoop. The links referenced within the Relational section above speak to why and when you use Hadoop:see recent posts, and Exploiting Big Data Strategies for Integrating with Hadoop.

There are a few interesting papers on the topic – somewhat dated but still useful:

Curt Monash’s site has a presentation worth looking at title: How to Select an Analytical Database. In general, Curt’s blog DBMS2is well worth tracking.

This deck presented by Mark Madsen at 2011 Strata Conference is both informative and amusing.

This is a bioinformatics deck that was interesting. It does not have a date on it. However, good information from a field that has driven developments in approaches to dealing with large complex data problems.

Information Architecture – A Moving Target?

6 Jul

I am increasingly seeing articles that talk about the confusion in identifying and building out the right information architecture for the organization. The article here, and with a clip below talk to that point. This is a good thing. People seek simplicity, and are looking for the prescriptive approach: 1) build a data warehouse; 2) build some datamarts for the business folks; 3) get a BI tool and build reports. But this does not cut it as it is too rigid a structure for analysts, or other stakeholders that have to do more than pull reports. The industry has responded by – I am speaking in buzzwords here – by adding “sandboxes”; by adding ODS (Operational Data Stores); and by adding a whole new way of landing, staging, persisting data and using it in analytical tasks (Hadoop). Sitting on top of this data level of the information architecture has been an explosion of tools that cater to (more buzzwords) data visualization, self serve BI, and data mashups to name a few.

Bottom line – how does this all get put together without creating an even bigger data mess than when you started? It is hard. What one sees so often is organizations putting off addressing the issue until they have a real problem. At this point, one sees a lot of sub-optimal management behavior. A consistent theme in the press is agility – organizations and their leaders need to embrace the agile manifesto. I am whole heartedly behind this. HOWEVER, agility needs to be framed within a plan, a vision, or at least some articulated statement of  an end point.

The article below is interesting as it presents agility as a key “must have”  management approach, and yet it also discusses the fact that in order for an agile approach to be successful, it needs to adopt disciplines that are decidedly un-agile! This creates a dual personality for leaders within the data management related functions of an organization (BI, analytics, ERP, …). On the one hand one wants to unleash the power of the tools and the creative intellect that is resident within the organization; on the other, there exists a desire to control, to reduce the noise around data, to simplify ones life. The answer is to embrace both – build a framework that provides long term guidance, and iteratively delivers capabilities within that framework towards a goal that is defined in terms of business capabilities – NOT technology or tightly defined tactical goals.

The framework – whichever approach one chooses will articulate the information architecture of the organization – how data flows around the organization to feed core business activities, and advance management’s goals! It is important – if it cannot be explained on a one page graphic, it is probably too complicated!

Martin’s approach to tying things together is below…

“”So given that there is not a one size fits all approach anymore – how does a company ensure its Information Architecture is developed and deployed correctly? Well, you have to build it from the ground up, and you have to keep updating it as the business requirements and implemented systems change. However, to do this effectively, the organisation must be cognisant of separating related workloads and host data on relevant and appropriate platforms, which are then tied together by certain elements, including:

See also:

  1. Polyglot persistence
  2. Data Management Maturity Model as an example of a way to start thinking about governance
  3. Agile development – a good idea so often badly implemented!

Slowly Sowly the various pieces to build out the Hadoop vision are coming together

27 Jun

This article talks about Hunk (Splunk + Hadoop). It is a good example of how the various pieces are coming together to enable the Hadoop vision to become reality. Where will the mainstream commercial off the shelf folks be? Many of the mainstream data vendors are not moving their code into Hadoop (as Hunk is), but rather moving the extract into their machines.

There are some folks, who believe that the COTS products will win in the end, as they are going to reduce the cost to operate – or the total cost of ownership – to the point that it does not matter how “free” the open source stuff is. On the other hand there are company’s that are going the other way, and creating non traditional support models around open source software. This started with Red Hat. However, Red Hat still has the same approach to licensing – no matter what you use in terms of support, you still pay the license fee – that sounds like Oracle redux to me. I have seen that in a number of other Open Source product vendors as well.  The new trend may be to support Open Source tools with a “pay for what you use” type menu. We will see how that goes. In the meantime, who names these products?

Seven Questions to Ask Your Data Geeks

13 Jun

This is a good article on some of the basics that are so often not asked.

I get disagreement, but for the most part big mistakes in analytics projects stem from the answers to these questions not being answered – rarely do they stem from the statistician getting the math wrong

Efficiencies with Columnar Databases

10 Jun

A little while ago, I posted an entry from Andrew’s Crabtree Analytics blog  where he had posted data on a test he did showing significant improvements on performance related to the use of columnar data structures when doing large joins. He has recently updated the test with significantly larger data volumes, and the results are posted here – still a big improvement.

 

 

[Book] Delivering Business Analytics: Practical Guidelines for Best Practice – AnalyticBridge

10 Jun

http://www.analyticbridge.com/m/group/discussion?id=2004291%3ATopic%3A246264

Looking for feedback on this book.