Delivering Enterprise Analytics - DataScienceCentral.com

This article broadly describes the capabilities that constitute an enterprise analytics program or competency. The intention initially, was to provide tips on mitigating challenges encountered in implementing an analytics practice – but that is going to be relegated to a future article.

IT projects in general, and analytics projects, in particular, are notoriously unsuccessful or “challenged”.

Focusing attention on the following short list prior to embarking on an analytics project or enterprise will help mitigate many challenges and obstacles encountered when delivering value through analytics projects.

clearly articulate the strategic (business) objectives of the solution/project at the outset
understand, profile and document the origins and sources of data before developing any data transformation pipelines
establish, and contract if necessary with data source providers to ensure that new/changed data can be identified and extracted on an incremental basis
document the important questions and insights relevant to the domain/business processes
identify the skills required to plan, build, evolve and operate the analytics landscape

The guiding intentions for project leaders and owners should be to :

achieve high value to end users,
minimise latency and maximise the availability of information
offer ease-of-use, high-performance data crunching ability.

These outcomes are dependent on several factors, and achieving them requires implementing and orchestrating some, or all of the following core capabilities listed below.

BI and Analytics Capabilities

BI Organisation and Processes
Data Pipelines / ETL Platform
Operational Data Store / Data Lake
Conformed Data Warehouse
Appliance / Aggregation Engine
Visualisation Tool/s
Statistical Processing
Machine Learning / AI Processing

Some of these capabilities may be bundled into a single platform or offering and recent trends have seen a convergence of aggregation layers and visualisation capabilities in particular in the same tool or product.

Organisations are unlikely to focus attention on all of these capabilities at the same time and should plan to mature their BI practice and competencies over time, in order to obtain effective practices across all these competencies.

1) Analytics Organisation and Processes

This is the most important consideration for any enterprise that is serious about building a sustainable analytics competency.

Skills cannot be merely purchased – as collegial and innovating teams are not available for sale and need to be bred and matured. Many organisations fall into the trap of procuring skills and expertise while neglecting to invest in team formation. They end up commoditising skills; ultimately to the detriment of sustainable success.

There is sometimes a high degree of attention paid to the tools and technology to be employed, but not to how these products will be integrated into the operational fabric of the enterprise and impact the lives of producers and consumers of analytical solutions on a daily basis.

The key profiles involved in the analytics value chain are broadly:

Consumers: People who use data and information to make decisions, inform opinions and generate ideas and insights. This includes business users, suppliers, customers, or even the broader public.
Analysts: People who interpret requirements and conceive solutions that are consumed by users. This group includes data scientists, actuaries, statisticians, economists, etc.
Data Engineers: People who prepare and manage data and databases; including data pipelines.
Visualisation Engineers: People who develop interfaces and reports to present data and information. These folk are creative and artistic with a flair for understanding data and information.

There are usually a number of other roles and stakeholders involved in a typical BI practice, but the above are the primary role players that need to be organised to achieve high quality delivery.

Once these role players are identified and instituted into a team (or teams), it is essential that there is seamless communication between all parties. Requirements and outcomes must be driven by Consumers – this may require facilitation by experienced analysts in order to formulate the briefs describing the solutions.

Feedback must be provided by delivery/development teams on a daily basis – and it should never occur that new functionality is not released in the timespan of days.

2) Data Pipelines

A core capability in any analytics practice is the data pipeline subsystem.; traditionally referred to as ETL. Ideally, the ETL system will span across the organisation and include a repository of the integration processes and rules existing between all applications. The data pipeline processes are described and managed in a repository that contains ALL of the published and productionised integration rules that enable communication between applications. Another component of the ETL subsystem is the engine that executes and coordinates integration jobs and processes; recording and collecting performance and execution statistics.

An increasingly important function of the ETL subsystem is to operationalise and automate testing and quality controls that will identify data quality issues and continuously measure data quality. This function includes triggering alerts and warnings when data quality thresholds are breached.

3) Operational Data Store / Data Lake

Implementing a conformed dimensional model that spans all the subject areas related to an enterprise’ activities used to be the holy grail of BI practice. Achieving this ideal, however, proves so fraught with challenges that it is quite customary to hear that the majority of BI projects fail.

Implementing an enterprise dimensional model is indeed costly and complex for most organisations.

A simpler initiative is often to implement an operational data store (ODS) – this is not a substitute for the dimensional data warehouse but is an additional and more primitive requirement.

An ODS is a database or platform that sources and stores data from line of business systems, external systems, public data feeds and social media platforms. The process of ingesting data into the ODS is such that very little or no transformation is applied when data is populated. Some changes to data formats and structures are sometimes made; for example to transform data from an object-oriented data source into a relational database.

The purpose of an ODS is to provide a single access point and system where all data required for analytics projects is available. The ODS will be used for ad hoc data investigation, reporting and analysis, and prototyping analytics pipelines, but may very well have the primary function of being the source for a more structured data warehouse.

ODS systems are usually relational databases, but in theory, can be any platform that is capable of storing data and allowing quick search and retrieval of data. Accommodating a large volume of data in a single platform or system can be expensive, and the requirements for relational ODS databases (both in storage and processing capacity) can become practically unsustainable.

An evolution of the relational ODS database is the data lake; which serves the same functional purposes but is implemented on a big data / Hadoop cluster.

Key features of ODS / Data Lake

Accommodate a variety of data formats and structures
Ingest data quickly
Retrieve data quickly
Extend storage capacity quickly and cheaply
Provide flexible securitisation and protection of data
Export data to multiple target platforms/systems

Due to the often chaotic process of ingesting a variety of data sources into an ODS/data lake – navigating this data can be difficult and unwieldly. This challenge can be mitigated by rigorous and high-quality meta-data management. Simple and intuitive logical organisation of information assets is necessary to enable effective usage of the ODS environment – lest it devolves into a “data swamp”.

Monitoring of the ODS usage patterns is required to provide guidance and specifications for future BI projects and requirements for the development and enhancement of more highly governed solutions.

4) Conformed Data Warehouse

There has always been debate about whether a conformed (Kimball) data warehouse needs to be a feature of enterprise analytics landscapes. The primary rationale for rigorously defining and implementing dimensions and facts is that they simplify the reporting and analysis activity for analysts and business users. The lived reality is that the process of implementing a conformed data warehouse is usually costly, complex and may require several false starts.

The alternatives however are hopelessly prone to result in a chaotic constellation of data artefacts that are difficult to use; there is hardly any mention these days of the information factory and Inmon approach to data warehousing. The debate today is rather around whether to invest in a dimensional data warehouse at all or to merely collect data into data lakes. There are many useful comparisons of the two dominant approaches to data organisation in a data warehouse – for instance (http://tdan.com/data-warehouse-design-inmon-versus-kimball/20300).

In my opinion, the circumstances where a conformed dimensionally modelled data warehouse can be dispensed with are rare. Apart from relieving the information analysis and consumption activities from complexity – a well-constructed data warehouse aids organisations in evolving a shared language for communicating about information and analytics, and empowers them to evolve into analytics-enabled and data-driven businesses.

A conscientiously designed and implemented dimensional data warehouse will probably always be a dominant feature of a successful analytics practice.

5) Aggregation Layer

The performance bottleneck is the bane, even sworn enemy, of engineers and vendors of every computing system. Rapid and exponential improvements in data processing speed have never succeeded in meeting the demands and expectations of developers and users. By their nature and definition – analytics and BI applications require high capacity data storage and processing capacity. The storage challenge is usually comfortably addressed but the processing demands often require stopgap approaches.

In a perfect scenario, granular data would never need to be materialised in pre-aggregated forms as any calculation or summary would be performed on demand. This goal of analytics enablement has proved so elusive that various mitigating approaches are employed to overcome computing constraints.

The most commonly used and successful remedies to the constraints imposed by insufficient CPU and main memory capacity are OLAP (multi-dimensional) and columnar databases. OLAP databases are particularly successful in environments where they are fed and supported by dimensional models. This configuration of OLAP databases being used to enable fast user access to aggregated data from dimensional data warehouses has been a standard model in BI delivery for most of the late 20th and early 21st century. Column compressed or tabular databases were highly touted circa 2005 but faded from hype spectacularly before emerging strongly nearly ten years later as a strong contender to OLAP databases. The choice today between either type of aggregation / analytical data layer is largely moot as both formats deliver comparable performance.

There are instances where the choice of aggregation technique is of consequence – baked into OLAP platforms are certain computational capabilities such as comparison and aggregation by time slices – no doubt inspired by the eager early adoption of OLAP based reporting by finance departments.

Column-oriented analytical databases are growing in favour and utility and are proving well suited to a wide variety of data sets.

6) Visualisation

The competition between vendors and platforms for BI tools is contested largely in the realm of visualisation capabilities.

Consumers of BI tools are often seeking all in one platform, that is with both data processing and visualisation features. The variety of data subject matter however far outweighs the demand for innovation in visualisation. This gap is filled largely by open source libraries using platforms and languages that offer more flexibility and opportunity for creative and even artistic expression in presenting information. Commercially available visualisation products are becoming increasingly difficult to differentiate – nearly all provide in-memory column compression coupled with responsive drag-and-drop charting. The likely next frontier will be easy-to-use capabilities to extend the range of charting possible.

7) Statistical Processing

The application of statistical processing in BI and analytics has traditionally been dominated by a few expensive proprietary platforms. The hardware requirements for statistical processing applications were also prohibitive. The introduction of the R platform and language to the data scientist’s arsenal has completely democratized the access to statistical analysis. It is now relatively easy to perform regression, clustering and even text analytics in R. The capabilities offered in R continue to evolve and expand rapidly and possibly eclipses those available in commercial platforms.

8) Machine Learning / AI

AI has moved quickly from being a cutting edge discipline to being widely adopted in practically every sphere of computing. In the domain of analytics and BI, machine learning and AI techniques are at the forefront of extracting hidden trends and patterns from large volumes of apparently chaotic data. Similar to statistical processing, the explosion in the application of AI is being driven largely by open source tools – primarily Tensorflow and Keras.