The Need For Processes

Across industries, an increasing number of organizations are moving to data-driven decision making. Some of these decisions are contingent on historical data, and trends, whereas some are made on the fly, usually the more critical ones, based on the point in time data. Some of these judgments must be made with the ever-metamorphosing nature of data, whereas some are repetitive, based on data, that does not change all that often.

Our focus for this article are those decisions that are made on a recurring basis, based on processes that can be repetitive in nature. Specifically, processes that can be automated, and re-run with minimal manual intervention. Typically, data-driven automated decision making or insight generating engines, are referred to as ‘models’, and they can be statistical or heuristic in nature. We will talk about the differences in later sections.

A typical automated model development process involves specific sequential modules, namely:

  • Determining the objective and impact
  • Deciding on the ‘tech-stack’
  • Data Analysis and Treatment
  • Statistical or Heuristic Modelling
  • Automation
  • Security and Governance

In this article, we shall talk about some best practices, around large scale models. We start with the importance of data, and all the decisions that the data are contingent on data. Moving on, we elaborate a little about the relevant technologies, as well some processes that are integral parts of data curating, treatment and analysis. The article goes on to talk about modeling techniques pertaining to different business cases and ends with underrated topics of security and governance of the tools.

Data-The Almighty

In today’s times of data-driven analytics, and decision making, all decisions starting from technology and ending in execution are contingent on data. Data is the string that binds the actions of the designer, the application programmer, the analyst and the end user. The relevance of data, to the decisions made in the process of building a large-scale model, makes a process redundant or otherwise, and sustainable or otherwise. Right data, used in the right context, can prove to be unprecedentedly powerful.

Building a large-scale sustainable and scalable model depends on a string of consequent and interdependent submodules. The first brick in the wall is the choice of the database. Some of the key aspects to keep in mind while thinking about the ideal database are:

  • What is the size of the data to be handled at once?
  • What is the redundancy cycle of the data?
  • What is the refresh cycle of the data?
  • Does the data get replaced periodically, or keeps getting appended?
  • What is the acceptable runtime of the process in question?
  • Is it a real time or a batch process?

Choosing the right technology

When we talk about selecting a technology, it is not selecting a solitary component. Rather, it is a conglomeration of isolated modules, that culminate into a successful final product. In its entirety, the isolated modules are collectively called the “tech-stack”. There are various components to a tech-stack:

  • Data Storage and Maintenance Component
  • Data Absorption Component
  • Data Presentation Component
  • Batch Processing and Periodic/ Scheduled runs layer
  • Security and Governance Component

The points (1), (2) and (4) are of relevance to Large Scale Model Building and Machine Learning – by far the most challenging parts of the Stack.

Data Storage and Maintenance and overview of most popular choices available:

Choosing the right database is the first step to a successful tech-stack. There are certain thumb rules for selecting the right tech stack, however, there are always adventurous, who take pleasure in the offbeat. The thumb rules however are:

  • Start with the end in mind
  • Choose the right data model
  • Disks are fast, and memory is faster
  • Both reads and writes need to be kept in mind

Based on the above four rules, there are certain databases, that are more widely used. Obviously, there are options today, however, the more popularly used database types are:

  1. Hadoop Distributed File System (HDFS): Scalable, Java-based distributed file system for large storage in inexpensive community hardware
  2. Amazon Simple Storage Services (S3): Cloud-based, Scalable Distributed file system. Widely used in Big Data Tech. Stack
  3. IBM General Parallel File System (GPFS): Clustered, the high-performance system from IBM.


  1. General purpose databases: Postgres, MySQL, MongoDB, MSSQL
  2. Distributed table-oriented databases: Cassandra, DynamoDB, HBase
  3. Massively parallel processing databases: Impala, SparkSQL, BigQuery, Redshift, Hive, Presto

From Data Processing to Distributed Batch Processing to Streaming to real time Models:      

It all began with Hadoop. Hadoop is an open source Distributed Data Management System. We have Hadoop 1.0 and Hadoop 2.0 which has quite some differences. MapReduce is the function in Hadoop-based out of the Google’s paper for efficient Large Scale Data Management and processing. MapReduce does the work of deciding Cluster Resources and allocating work and finally Mapping the results.

However, with the scale of data rising to infinite proportions, even the MapReduce proved inefficient especially when you must create a Mathematical Model that involves a lot of Intermediate steps and transformations on the data. Hence the requirement of further Distributed Resource Management. This is where YARN comes in.

For Streaming, in Hadoop 2.0 can be layered with Stream Computation frameworks like Storm and Giraph to reach real-time performance.

Your Technology Stack – Analysis in Real Time – when your models are giving results real time:

With the demand of Real Time Analytics, rose the need for even further efficiency in terms of Memory and Cluster Management. The MapReduce though efficient is not efficient enough to do Real Time. The primary reason being how every single transformation is written into the disk. The In-memory solution provided by the Apache Spark Framework was built to deal with problems of MapReduce.

If you are creating a Technology Stack around Hadoop and Real Time Analytics is required then –a scalable in-memory layer should be included between CEP layer (Complex Event Processing ~ framework that collects data from multiple sources). Now the Technology Stack has the following structure:

  • Hadoop MapReduce will batch your processes
  • Storm will process in the Real Time

First a clear understanding of the Business Requirement. If all the Business Requirements can be met with the semi-supervised adaptive algorithm that runs periodically and has a business aspect (intervention) to it — then you might consider the above structure which is cost effective and meets your needs. Twitter is using Storm for their purposes.

There is another solution Framework called Giraph (just replace Storm above with Giraph). Giraph is based out of Graph Processing. Facebook uses Giraph with some performance improvements to analyze one trillion edges using 200 machines in just 4 minutes.

Data Processing – Iterative Machine Learning – overcoming the bottlenecks using the right Technology – Spark is particularly good:

Almost all Machine Learning Algorithms based out of Iterations on the same data. In the Big Data framework, this translates to creating a lot of intermediate data sets on the go. The parallel framework of MapReduce is not sufficient to take the burden of iterations and efficient Memory Management. Apache Spark, on the other hand, manages this quite efficiently – the Resilient Distributed Data comes to rescue where each data set can be visited multiple times using a loop. In the Machine Learning context, Spark does it is by Caching the Intermediate Datasets and then performing Iterations.

Most of their Data Storage and Management across the world is built on Hadoop. The challenge is where due to emerging Business Needs, the Organisations wants to upgrade the Data Base Management to facilitate more real time Algorithms. We at affine have converted Hive based Legacy systems to Spark Environment. After which we have built models on the same.

For the Reader’s Interest and reference around how to manage their stack based out of their requirements in the model building:

Data Processing and Management is an integral part of any Large-Scale Model Building and Maintenance. Ask yourself the following questions:

  • Plan your data processing requirements based on your algorithm and Business requirement – Is Batch Processing enough? Or is your algorithm more around real time like a recommendation engine or data updates happening real time like POS data?
  • If you are strictly real time, the first thing your technology stack should know is the upper bound for the Data Size Streaming into your system. Factor in 10 times.
  • What is the expected size of the data in one Iteration of your model? Extrapolate and find out the total size and factor in 10 times?
  • Does model performance dependent on more number of Iterations (more iterations to train from to get better and better performance) Vs not quite steeped into Iterations but also Business Decisions?

Data Handling

Once the tech-stack is finalized, and the data is available, we move into the development phase of the process. There are various subparts to the development phase:

EDA and AD creation

EDA involves looking at the data from many different angles. Slicing and dicing the data along non-trivial, non-orthogonal dimensions and combinations of dimensions. Transforming the data through some nonlinear operators, projecting the data onto a different subspace, and then examine the resulting distribution.

  • On Line Analytic Processing

This is also known as fast analysis of shared multidimensional information – FASM. It allows the end-users of multidimensional databases to generate the summaries accessible by the computer which contains the information of data and other systematic queries. Even though the name existed as on-line there is no compulsory of analyses of OLAP in real-time only. It just refers to analyze the multidimensional databases.

  • Traditional EDA

We can do the EDA using SQL or programming languages. Every big data storage comes through a host of applications built over them, to enable us to do EDA in SQL or other querying languages.

  • Neural Networks

Neural Networks is one of the Data Mining techniques. Neural Networks are analytical techniques which are inspired by the biological nervous system such as brain to process the information. In this system based on the functions of brain it can predict new observations from other observations (i.e., from one variable to another variable) by using a process called learning from existed data.

AD creation

  • Variable identification
  • Variable Modification
  • Calculated fields
  • Variable transformation

Following these, we can create our AD in the desired format. We can use programming languages or we can use SQL as well to create the AD.

The creation of a robust AD is of paramount importance when it comes to the modeling process. Since the entire automated framework is sequential in nature, all subsequent steps run the risk of collapsing, if the AD creation is not robust.


Model Development

Once the data has been explored, cleaned and the AD is prepared, it is time to move into the actual modeling phase. Now for the purposes of this article, since the actual technique and the results are not relevant, we will not delve deep into the gory details. However, just to touch upon some salient features of the modeling phase, the first decision is whether the model is statistical, machine learning, or simply heuristic. The decision depends on several factors.

  • The client – Whether the client is more comfortable with complex statistical procedures. Whether the users of the final product, are interested in equations or just results.
  • The objective – What are we trying to predict. Whether it is a scoring model or a predictive model.
  • The application – Will the model be used for complex and critical procedures or more of a directional perspective.
  • The resources – What are the resources that are at our disposal, in terms of money and manpower, specifically resource skillset.

Based on the above arguments, the final modeling technique is selected. Obviously, to serve the same purpose, there are multiple models that can suffice, however, which one will fit the data, as well as the above factors, there usually is one answer. In case there are multiple, certain other factors, like the ease of automation, execution time, interpretability come into the picture.

Some common programming languages that are used in today’s world for building large scale models are:


Of course, there are other open source and paid programming languages, but these are the more widely used variants. Each of these programming languages has strengths of their own, that can be leveraged in relevance to the kind of business problem one is trying to solve, the volume of data and frequency of refresh. If there is UI that the model will manifest itself into, it is wise to choose programming languages, based on their compatibility with frontend and middleware components.

Automation and Model Manipulation

The actual modeling phase is followed by the automation phase. The modeling must be a zero-dependence process, with zero manual intervention. All the stages starting from the AD creation till the production of results should be made error free, and self-sufficient. Automation of the model building process entails automation of the data treatment and AD creation phase as well as automation of the modeling phase.

Automation of the Data Handling Phase involves:

  • Error and Exception Handling: All the possible scenarios where the model might throw an error, have to be internally treated within the data.
  • Memory overload error: Limitation of each node needs to be identified and then the process should be batched when the data exceeds the desired limit.
  • Code failure: An automated reply needs to be generated that gets triggered every time there is a code failure due to data or system requirements.
  • Stress Testing: Prepare cases in which the automated jobs might fail.
  • Error code standard: the error message should be easily understandable by anyone

Automation of the modeling phase involves:

  • Optimization of Model: Some models, need building every time new data flows in, whereas some are scoring based, that are just refreshed from time to time. One needs to ensure, the time taken for either process is feasible.
  • Iteration of models: The system should run multiple iterations of the model, and select the best model.
  • Comparison of results / Gradient Boosting: In case the model internally comes up with multiple acceptable or good models, rules need to be in place to select the best model.
  • Documentation: Most importantly, there should be exhaustive documentation around rules and processes within the architecture. For cases, where the developers are not the ones responsible for maintenance, documentation is of paramount importance.


Model Maintenance

Predictive analytics models use data gleaned from past experiences and events to identify future risks and opportunities. Conditions and environments are constantly changing and thus need to be reflected in the models, otherwise, model performance will decay over time.

As organizations rely on predictive models on an increasingly larger scale, establishing a consistent, systematic model management lifecycle methodology is of paramount importance. A typical model management lifecycle consists of data management, modeling, validation, model deployment and model monitoring.

The frequency of monitoring a modeling process depends on:

  • Frequency of change of business environment and industry
  • Frequency of model driven decisions
  • Recency of model
  • Frequency of data refresh

The model monitoring process itself is sequential and modulated. There are specific subparts, that need to be developed separately, and eventually married into the bigger picture. Namely:

  • Performance Metrics Monitoring: Metrics around monitoring, specific to the modeling process deployed.
  • Reports: Creating user-friendly reports, that get generated periodically; along with relevant flags(RAG), and intimations.
  • Triggers: Intimation around requirements of model recalibration, based on critical values of relevant parameters.

The best way to go about this, along with careful deliberation, and comprehensive knowledge around business processes, is active dialogue with the business and stakeholders. Once the metrics are developed, and the reports and dashboards around model monitoring are created, there needs to be a well systematic approach of appraising the relevant stakeholders, whenever there is the reason for concern. Rather advanced automated models, make these triggers exceptionally specific; however, most models, have more of a generic approach to triggering anomalies, and deterioration.

Security and Governance

Once the product, be it a tool or an inbuilt feature to a bigger scheme of things is built and is good to be launched, there is just that icing on the cake which sometimes proves to be a crucial inclusion: “Security”. Not all information, is meant for all eyes – hence the need for this layer. The inclusion of security features has distinct advantages:

  • Prevents dilution of information
  • Prevents corruption of data
  • Prevents overload of the tool
  • Keeps information constrained within the relevant audience


Now a security and governance layer is of especial importance when the results of the backend model are viewed as a web-tool or handheld app. In which case, the backend tech-stack is integrated with the front end User Interface (UI). For such instances, security has two aspects to it:

  1. Authorization – Defining which user will have access to which database. LDAP can be used to provide authorization to users. With LDAP, which user has what level of access to data can be determined
  2. Authentication: Kerberos is a Network Authentication Protocol. Provides authentication for Client/ Server applications using secret-key cryptography

Pluggable Authentication Modules help integrate authentication systems like Kerberos and LDAP (Lightweight Directory Access Protocol). It provides more security. The administrator could decide if two passwords need to be entered or one is enough, it can only be a one-time password or no password at all.

Once the basic Authentication and Authorization is done then a Secure Sockets Layer(SSL) can be built. It secures the link between a Web browser and Server. On top of this, there must be regular penetration testing to check how strong the security. Security must be continuous and cannot be a one-time thing.

  1. Encryption: Encrypting the data is useful to stop exposures to breaches
  2. Use best practices for Ansible, Chef, Salt or Puppet
  3. Communication: log all communication for detecting anomalies


The buck doesn’t stop at setting up the security layer. The next requisite action is governance of the security module. Once the users are decided, and their credentials created – there needs to be an administrator who will govern the usage of the security layer. The addition of users, resetting passwords, deletion of users, periodic password renewal, are some of the common governance tasks. Alongside, there is also another maintenance related tasks such as:

  1. Audit: Internal/external audit to check for data
  2. Strict Policies and Procedures which needs to be implemented
  3. Data Standards needs to be maintained
  4. There must be data checks and Balances
  5. Information lifecycle management: Policies around archiving, retention and purging
  6. System Monitoring: Provision for data monitoring systems like Data Dog

Products like Apache Atlas and Apache Ranger provides data governance.



The entire article, till now, has been crafted out of first-hand experiences from an Affine perspective. More and more analytics companies are adopting newer technology and delving deep into the realm of big data analytics (or at least claiming to do so). To have any bare minimum standing in the world of big data or the world of serious analytics and data science, in-depth knowledge of these technologies is indispensable. Affine, in this regard, has leveraged some of these technologies and managed to build several successful automated large scale models, most of which culminate into a UI, for the user. We, at Affine, has managed to marry the skills required for a robust backend, as well as a lucid and attractive front-end, to produce some high-impact web tools and models.

Starting from forecasting tools, inventory management to price optimization, we at Affine have managed to procure multiple Fortune 500 “happy” clients when it comes to solving day to day repetitive modeling requirements. Most of the technologies and programming languages mentioned have been tried and implemented across various projects, for domains ranging from retail to automobile industries. We have built a reputation as the “one-stop-shop” for analytic, consulting and UI competencies, around large scale models, involving large scale data.


Shuddhashil Mullick

Shuddhashil Mullick

Senior Delivery Manager at Affine Analytics

More Posts

Follow Me: