Tag Archives: products

Forensic Analytics and the search for “robust” solutions

12 Jan

Happy New Year!

This entry has been sitting in my “to publish” file for some time. There is much more to be said on the topic. however, in the interest of getting it out … enjoy!

=======================================================

This entry was prompted by the article in the INFORMS ANALYTICS Magazine article titled Forensic Analytics: Adapting to a Growing Pandemic by Priti Ravi who is a senior manager with Mu Sigma and specializes “in providing analytics-driven advisory services to some of the largest retail, pharmaceutical and technology clients spread across the United States.”

Ms. Ravi writes a good article that left me hanging. Her conclusion was that the industry lacks access to sophisticated and intelligent monitoring equipment, and there exists a need for a “robust fraud management systems” that “offer a collective set of techniques” to implement a “complex adaptive approach.” I could not agree more. However, where are these systems? Perhaps even what are these systems?

Adaptive Approaches

To the last question first. What is a Complex Adaptive Approach? If you Google the phrase, the initial entries involve biology and ecosystems. However, wikipedia’s definition encompasses medicine, business and economics (amongst others) as areas of applicability. From an analytics perspective, I define complex adaptive challenges as those that  are impacted by the execution of the analytics – by doing the analysis, the observed behaviors change. This is inherently true of fraud as the moment perpetrators  understand (or believe) they can be detected, behavior will change. However, it also applies to a host of other type of challenges: criminal activity, regulatory compliance enforcement, national security; as well as things like consumer marketing and financial investment.

In an article titled Images & Video: Really Big Data the authors (Fritz Venter the director of technology at AYATA; and Andrew Stein the chief adviser at the Pervasive Strategy Group. define an approach they call “prescriptive analytics” that is ideally suited to adaptive challenges. They define prescriptive analytics as follows:

“Prescriptive analytics leverages the emergence of big data and computational and scientific advances in the fields of statistics, mathematics, operations research, business rules and machine learning. Prescriptive analytics is essentially this chain of transformations whereby structured and unstructured big data is processed through intermediate representations to create a set of prescriptions (suggested future actions). These actions are essentially changes (over a future time frame) to variables that influence metrics of interest to an enterprise, government or another institution.”

My less wordy definition:  adaptive approaches deliver a broad set of analytical capabilities that enables a diverse set of integrated techniques to be applied recursively.

What Does the Robust Solution Look Like?

Defining adaptive analytics this way, one can identify characteristics of the ideal “robust” solution as follows:

  • A solution that builds out a framework that supports the broad array of techniques required.
  • A solution that is able to deal with the the challenges of recursive processing. This is very data and systems intensive. Essentially for every observation evaluated, the system must determine whether or not the observation changes any PRIOR observation or assertion.
  • A solution that engages users and subject matter experts to effectively integrate business rules. In an environment where traditional predictive analytic models have a short shelf life (See Note 1), engaging with the user community is often the mechanism to quickly capture environmental changes. For example, in the banking world, tracking call center activity will often identify changes in fraud behavior faster than a neural network set of models. Engaging the User in the analytical process will require user interfaces, and data visualization approaches that are targeted at the user population, and integrate with the organization’s work processes. Visualization will engage non technical users to help them apply their experience and intuition to the data to expose insights. The census bureau has an interesting page, and if you look at Google Images, you can get an idea of visualization approaches.
  • A solution that provides native support for statistical and mathematical functions supporting activities associated with data mining : clustering, correlation, pattern discovery, outlier detection, etc.
  • A solution that structures unstructured data: categorize, cluster, summarize, tag/extract. Of particular importance here is the ability to structure text or other unstructured data into taxonomies or ontologies related to the domain in question.
  • A solution that persists data with the rich set of metadata required to support complex analytics. While it is clearer why unstructured data must be organized into a taxonomy / ontology, this also applies to structured data. Organizing data consistently across the variety of sources allows non obvious relationships to be exposed, and application of more complex analytical approaches.
  • A solution that is relatively data agnostic  – data will come from many places and exist in many forms. The solution must manage the diversity and provide a flexible way to integrate new data into the analytical framework.

What are Candidate Tools ?

And now to the second question: where are these tools? It is hard to find tools that claim to be “adaptive analytic” tools; or “prescriptive analytics” tools or systems in the sense that I have described them above. I find it interesting that over the last five years, major vendors have subsumed complex analytical capabilities into a more easily understandable components. Specifically, you used to be able to find Microsoft  Analytical Services easily on their site. Now it is part of MS SQL Server as SSAS; much the same way that the reporting service is now part of the database offer as SSRS (reporting services). There was a time a few years ago when you had to look really hard on the MS site to find Analytical Services. Of course since then Microsoft has integrated various BI acquisitions into the offer and squared away their marketing communication. Now their positioning is squarely around  BI and the database. Both of these concepts are easier to sell at the executive level, than the notion of prescriptive or adaptive analytics.

The emergence of databases and appliances optimized around analytics has simplified the message on the data side. everyone knows they need a database, and now they have one for analytics. At the decision maker level, that is a much easier decision than trying to figure out what kind of analytical approach the organization is going to adopt. People like Teradata have always supported analytics through the integration of SAS and now R as in-database functionality. However, Greenplum, Neteeza and others have incorporated SAS and the open source analytical “R” . In addition, we have seen the emergence (not new but much more talked about it seems) of the columnar database. The one I hear about most is the Sybase IQ product; although there have been a number of posts on the topic on here, here, and here.

My point here is that vendors have too hard a time selling complex analytical solutions, and have subsumed the complex capabilities into the concepts that are easier to package, position and communicate around; namely; database products and Business Intelligence products. The following are product sets that are candidates for the integrated approach. We start with the big players first and work towards that are less obviously candidates.

SAS

The SAS Fraud Framework provides an integration of all the SAS components that required to implement a comprehensive analytics solution around adaptive challenges (all kinds of fraud, compliance, money laundering, etc. as examples). This is a comprehensive suite of capabilities that spans all activities: data capture, ingest, and quality; analytics tools (including algorithm libraries), data visualization and reporting / BI capabilities. Keep in mind that SAS is a company that sells the building blocks, and the Fraud Framework is just that, a framework within which customers can build out capabilities. This is not a simple plug and play implementation process. It takes time and investment and the right team within the organization. The training has improved, and it is now possible to get comprehensive training.

As with any implementation of SAS, this one comes with all the caveats associated with comprehensive enterprise systems that integrate  analytics into the fabric of an organization. The Gartner 2013 BI report indicates that SAS “very difficult to implement”. This theme echoes across the product set.  Having said that   when it comes to integrated analytic of the kind we have been discussing all, of the major vendors suffer from the same implementation challenges – although perhaps for different reasons.

Bottom line however, is that SAS is a company grounded in analytics – the Fraud Framework has everything needed to build out a first class system. However, the corporate culture builds products for hard core quants, and this is reflected in the Gartner comments.

IBM

IBM is another company that has the complete offer. They have invested heavily in the analytics space, and between their ETL tools; the database/ appliance and Big Data capabilities; the statistical product set that builds off SPSS; and, the Cognos BI suite users can build out the capabilities required. Although these products are being integrated into a seamless set of capabilities, they remain somewhat separate and this probably explains some of the implementation challenges reports. Also, the product side of the IBM operation does not necessarily speak with the Global Services side of the house.

I had thought when IBM purchased Systems Research & Development (SRD) in 2005 that they were going to build out capabilities that SRD and Jeff Jonas had developed. Jeff heads up the Entity Analytics group within IBM Research, and his blog is well worth the read. However, the above product set appears to have remained separated from the approaches and intellectual knowledge that came with SRD. This may be on purpose – from a marketing perspective, buy the product set, and then buy IBM services to operationalize the system is not a bad approach.

Regardless, as the saying goes, no one ever got fired for buying IBM” probably still holds true. However, like SAS beware of the implementation! Any one of the above products (SPSS, Cognos, and Infosphere) require attention when implementing. However, when integrating as an operational whole, project leadership needs to ensure that expectations as to the complexity and time frame are communicated.

Other Products

There are many other product sets and I look forward to learning more about them. Once I post this, someone is going to come back and mention “R” and other open source products. There are plenty out there. However, be aware that while the products may be robust, many are not delivered as an integrated package.

With respect to open source tools, it is worth noting that the capabilities inherent in Hadoop – and the related products, lend themselves to adaptive analytics in the sense that operators can consistently re-link and re-index on the fly without having to deal with where and how the data is persisted. This is key in areas like signals intelligence, unstructured data analysis, and even structured data analysis where the notion of semantic equivalence is shifting. This is a juicy topic all by itself and worthy of a whole blog entry.

Notes:

  1. Predictive analytics relies on past observations to predict future observations. In an adaptive environment, the inputs to those predictive models continually change as a result of the outputs using the past observations.
Advertisement

Evaluating Different Persistence Methods as part of the Planning Process

4 Jul

Every once in a while, I get asked about how to select between different types of databases. Generally, this comment is as a result of a product vendor or consultant making a recommendation to evolve towards a Big Data solution. The issue is twofold in that companies seek to understand what the next generation data platform looks like; AND, how or if their current environment can evolve. This involves understanding the pros and cons of the current product set and to what degree they can exist with newer approaches – Hadoop being the current platform people talk about.

The following is a list of data persistence approaches that helps at least define the options. This was done some time ago, so I am sure the vendors shown have evolved. However, think of it as a starting point to frame the discussion.

In general, one wants to anchor these discussions in some defined criteria that can help frame the discussion within the context of  business drivers. In the following figure, the goal is to show that as data sources and consumers of your data expand to include increasingly complex data structures and “contexts,” there is a need to evolve approaches beyond the traditional relational database (RDBMS) approaches. Different organizations will have different criteria. I provide this as a rubric that has worked before – you will need to create an approach that works for your organization or client.

Evolution of Data Persistence

A number of data persistence approaches  support the functional components as defined. These are described below.

Defined Pros / Cons Vendor examples
Relational Databases

(Row Orientation)

Traditional normalized data models optimized for efficiently storing data Relational structures are best used when the data structures are known and change infrequently. Relational designs often present challenges for analysts when queries and joins are executed that are incompatible with the design schema and / or indexing approach. This incompatibility creates processing bottlenecks, and resource challenges resulting in delays for data management teams. This approach is challenged when dealing with complex semantic data where multiple levels of parent / child relationships exist.

Advantages: This approach is best for transactional data where the relationships between the data and the use cases driving how data is accessed and used are stable. In uses where relational integrity is important and must be enforced in a consistent manner, this approach can work well. In a row based approach, contention on record locking are easier to manage than other methods.

Disadvantages: As the relationships between data and relational integrity are enforced through the application of a rigid data model, this approach is inflexible, and changes can be hard to  implement.

All major database vendors: IBM – DB2; Oracle; MS SQL and others
Columnar Databases

(Column Oriented)

Data organized  or indexed around columns; can be implemented in SQL or a NoSQL environments. Advantages: Columnar data designs lend themselves to analytical tasking involving large data sets where rapid search, retrieval and aggregation type queries are performed on large data tables. A columnar approach inherently creates vertical partitioning across the datasets stored this way. It is efficient and scalable.

Disadvantages: efficiencies can be offset by  the need to join many queries to obtain the desired result.

•Sybase IQ

•InfoBright

•Vertica (HP)

•Par Accel

•MS SQL 2012

Defined Pros / Cons Vendor examples
RDF Triple Stores / Databases Data stored organized around RDF triples (Actor-action-object OR Subject-predicate-Object); can be implemented in SQL or a NoSQL environments. Advantages: A semantic organization of data lends itself to analytical and knowledge management tasks where the understanding of complex and evolving relationships is key. This is especially the case where ontologies or SKOS (1) type relationships are required to organize entities and their relationships to one another: corporate hierarchies/networks; insider trading analysis for example. This approach to organizing data is often represented in the context of the “semantic web” whose organizing constructs are RDF and OWL. when dealing with complex semantic data where multiple levels of parent / child relationships exist, this approach is more efficient that RDBMS

Disadvantages: This approach to storing data is often not as efficient as relational approaches. It can be complicated to write queries to traverse complex networks – however, this is often not much easier in relational databases either.

Note: these can be implemented with XML  formatting or in some other form.

Native XML  / RDF Databases

•Marklogic (COTS)

•OpenLink Virtuoso (COTS)

•Stardog (o/s, COTS)

•BaseX  (o/s)

•eXist  (o/s)

•Sedna (o/s)

XML Enabled Databases

•IBM DB2

•MS SQL

•Oracle

•PostgrSQL

XML enabled databases deal with XML as a CLOB in a table or organized into tables based on a schema

Graph Databases A database that uses graph structures to store data. See XML / RDF Stores / Databases. Graph Databases are a variant on this theme.

Advantages:  Used primarily to store information on networks. Optimized for iterative joins; often in a recursive process (2)..

Disadvantages: Storage challenges – these are large datasets; builds through iterative joins – very processor intensive.

•ArangoDB

•OrientDB

•Cayley

•Aurelius Titan

•Aurelius Faunus

•Stardog

•Neo4J

•AllegroGraph

(1) SKOS = Simple Knowledge Organization Structure. Relationships can be expressed as triples; examples are “is part of”; “is similar to”

(2) Recursion versus iteration

Defined Pros / Cons Vendor Examples
NoSQL

File based storage – HDFS

Data structured to expose  insights through the use of “key pairs” This has many of the characteristics of the XML, Columnar and Graph approaches. In this instance, the data is loaded, and key value pair (KVP) files created external to the data. Think of the KVP as an index with a pointer back to the source data. This approach is generally associated with the Hadoop / MapReduce  capabilities, and the definition here assumes that KVP files are queried using the capabilities available in the Hadoop ecosystem

Advantages: flexibility; MPP capabilities; speed; schema-less; scalable; Great at creating views of data; and performing simple calculations across Big Data; significant open source community – especially through the Apache Foundation. Shared nothing architecture optimizes the read process. However, it creates challenges in meeting ACID (1) requirements. File based storage systems adhere to the BASE (2) requirements

Disadvantages: Share nothing architecture creates complexity in uses where sequencing of transactions or writing data is important – especially when multiple nodes are involved; complex metadata requirement; few tool “packages” available to support production environments; relatively immature product set.

Document Store

•Mongo DB

•Couch DB

Column Store

•Cassandra

•Hbase

•Accumulo

Key Value Pair

•Redis

•Riak

(1) ACID = Atomicity; Consistent; Isolated; Durable. Used for Transaction processing systems.

(2) BASE = Basic Availability, Soft State; Eventual Consistency. Used for distributed parallel processing systems where maintaining complete consistency is often prohibitively expensive

Defined Pros / Cons Vendor examples
In-Memory Approaches Data approaches where the data is loaded into active memory to improve efficiency Note that multiple persistence approaches can be implemented in memory

Advantages: Speed; flexibility – ability to virtualize views and calculated / derived tables; think of Datamarts in the traditional BI context

Disadvantages: Hardware, cost

•SAP HANA

•SAS High Performance Analytics

•VoltDB

The classes of tools below are presented as they provide alternatives for capabilities that are likely to be required. Many of the capabilities are resident in some of the tool sets already discussed.
Data Virtualization The ability to produce tables or views without going through an ETL process Data  virtualization is a capability built into other products. Any In- Memory product inherently virtualizes data. Likewise a number of the Enterprise BI tools allow data – generally in the form of “cubes” to be virtualized. Denodo Technologies is the major pure play vendor. The others vendors generally provide products that are part of larger suites of tools. •Composite Software (Cisco)

•Denodo Technologies

•Informatica

•IBM

•MS

•SAP

•Oracle

Search Engines Data management components that are used to search structured and unstructured data Search engines and appliances perform functions as simple as indexing data, and as complex as Natural Language Processing (NLP) and entity extraction. They are referenced here as the functionality can be implemented as stand alone capability and may be considered as part of the overall capability stack. •Google Search Appliance

•Elastic Search

Defined Pros / Cons Vendor examples
Hybrid Approaches Data products that implement both SQL and NoSQL approaches These are traditional SQL database approaches that have been partnered with one or more of the approaches defined above. Teradata acquired Aster to create a “bolt on” to a traditional SQL Db; IBM has Db2/Netezza/Big Insights. SAS uses a file based storage system and has created “Access Modules” that work though Apache HIVE to apply analytics within either an HDFS environment, or the SAS environment.

Another hybrid approach is exemplified by Cassandra that incorporates elements of a data model within a HDFS based system.

One also sees organizations implementing HDFS / RDBMS solutions for different functions. For example acquiring, landing and staging data using an HDFS approach, and then once requirements and the business use is known creating structured data models to facilitate and control delivery

Advantages: Integrated solutions; ability to leverage legacy; more developed toolkits to support production operations. Compared to open source, production ready solutions require less configuration and code development.

Disadvantages: Tend to be costly; architecture tends to be inflexible – all or nothing mindset.

•Teradata

•EMC

•SAS

•IBM

•Cassandra (Apache)

A comparison of programming languages in economics

8 Jul

Interesting comparison of programming language speeds. Given that the big data world seems to be all about Python, I wonder if folks start doing complicated calculations over big data if they will move away from Python? SAS is apparently working on “Accelerators” to work on hadoop nodes which appear to address this same problem. They already have them for Databases and Db appliances.

The above makes sense if you consider that for the most part “big Data” is about folks doing simple calculations in parallel  over many data nodes.

The thread of comments below the article are also interesting.

===================================

There is a new NBER working paper with that title, by S. Borağan Aruoba and Jesus Fernandez-Villaverde. Here is the abstract:

We solve the stochastic neoclassical growth model, the workhorse of modern macroeconomics, using C++11, Fortran 2008, Java, Julia, Python, Matlab, Mathematica, and R. We implement the same algorithm, value function iteration with grid search, in each of the languages. We report the execution times of the codes in a Mac and in a Windows computer and comment on the strength and weakness of each language.

Here are their results:

1. C++ and Fortran are still considerably faster than any other alternative, although one needs to be careful with the choice of compiler.

2. C++ compilers have advanced enough that, contrary to the situation in the 1990s and some folk wisdom, C++ code runs slightly faster (5-7 percent) than Fortran code.

3. Julia, with its just-in-time compiler, delivers outstanding per formance. Execution speed is only between 2.64 and 2.70 times the execution speed of the best C++ compiler.

4. Baseline Python was slow. Using the Pypy implementation, it runs around 44 times slower than in C++. Using the default CPython interpreter, the code runs between 155 and 269 times slower than in C++.

5. However, a relatively small rewriting of the code and the use of Numba (a just-in-time compiler for Python that uses decorators) dramatically improves Python ’s performance: the decorated code runs only between 1.57 and 1.62 times slower than the best C++ executable.

6.Matlab is between 9 to 11 times slower than the best C++ executable. When combined with Mex files, though, the difference is only 1.24 to 1.64 times.

7. R runs between 500 to 700 times slower than C++ . If the code is compiled, the code is between 240 to 340 times slower.

8. Mathematica can deliver excellent speed, about four times slower than C++, but only after a considerable rewriting of the code to take advantage of the peculiarities of the language. The baseline version our algorithm in Mathematica is much slower, even after taking advantage of Mathematica compilation.

There are ungated copies and some discussion here.

 

 

Slowly Sowly the various pieces to build out the Hadoop vision are coming together

27 Jun

This article talks about Hunk (Splunk + Hadoop). It is a good example of how the various pieces are coming together to enable the Hadoop vision to become reality. Where will the mainstream commercial off the shelf folks be? Many of the mainstream data vendors are not moving their code into Hadoop (as Hunk is), but rather moving the extract into their machines.

There are some folks, who believe that the COTS products will win in the end, as they are going to reduce the cost to operate – or the total cost of ownership – to the point that it does not matter how “free” the open source stuff is. On the other hand there are company’s that are going the other way, and creating non traditional support models around open source software. This started with Red Hat. However, Red Hat still has the same approach to licensing – no matter what you use in terms of support, you still pay the license fee – that sounds like Oracle redux to me. I have seen that in a number of other Open Source product vendors as well.  The new trend may be to support Open Source tools with a “pay for what you use” type menu. We will see how that goes. In the meantime, who names these products?

Efficiencies with Columnar Databases

10 Jun

A little while ago, I posted an entry from Andrew’s Crabtree Analytics blog  where he had posted data on a test he did showing significant improvements on performance related to the use of columnar data structures when doing large joins. He has recently updated the test with significantly larger data volumes, and the results are posted here – still a big improvement.

 

 

TIBCO – Buys another company

1 Apr

TIBCO buys another company in the analytics space. I have always thought that with Spotfire, the Enterprise Service Bus business, and the acquisition of Insightful some years ago, TIBCO had the makings of a company that was putting together the Big Data analytical stack. With this purchase, the have added a geo capability. Will they ever get all these pieces integrated to create a solutions package – like SAS’s Fraud Framework? Not sure why they have not done that to date. It may just be that it is too hard to sell complete solutions, and it is easier to get in the door with a point solution? Anyway – I like Spotfire, and anything they do to build out the back end is good stuff. Price point still seems a little high for a point solution, but they seem to be making it work for them, so who am I to argue… interesting to see how this plays out.

See also here – as they post in the MDM Magic Quadrant as well.

Magic Quadrant for Data Integration Tools

23 Feb

Gartner Data Integration Survey

October 2012 – All the normal suspects. However, was surprised (and pleased) to see Talend in the mix. Interesting to note that SAS is in the lead with the number of installs (13k) – up there with Microsoft  (12k).

Link

Microsoft Powerpivot

17 Feb

Microsoft Powerpivot

I have always thought that Tableau was initially just a cool way to create Excel pivot tables and populate the results in a graphic – something you can do in Excel, but was  a lot easier in Tableau. Is Powerpivot the  MS answer to these tools that have leveraged excel, but have not used the excel visualization capabilities or been willing/able to write the VB code to get Excel to do what you want it to do?

I do not have Office 2013, but look forward to playing with this when I do.

Link

Another reason why Data Management and Analysts cannot lead separate lives

14 Feb

Another reason why Data Management and Analysts cannot lead separate lives

I found this article interesting in that it points out why the bridge between the data side of the house and the analytical side must be well established – if the data team implements a design that does not support analytics, it has material impacts. I know this is blindingly obvious, but ….

I have recently been in a number of discussions where the attitude was we are going to build the data warehouse using best practices and years of experience, and it really does not matter what you are going to do with the data. I know it crazy, but… you know what I am talking about – we see it all the time.

The article itself tests performance on a columnar versus relational approach to persisting data, and has some surprising results – 4,100% improvement! I would be interested in other studies that have looked at the difference between different data architectures when performing analytical tasks.

%d bloggers like this: