Even though processing and storage have become cheap and enterprises are adopting high performance analytics infrastructure, still in majority of cases the analytics study is constrained by local system. Multiple scenarios like working on a proof-of-concept in a small enterprise which can’t afford investing on heavy analytics infra, academic projects, hobby projects… the list goes on and the common factor here is system constraints.
Though we can’t escape the scenarios where we are endlessly staring at the blinking cursor, we can definitely decrease such instances. A faster end result or a swifter turnaround of sample case testing can really make the difference with a lesser frustrating code execution sessions. The two primary reasons of time intensive computations are data size and algorithm complexity. In the following article, we shall look into some effective measures to decrease the data size, without compromising on the output.
- Ingest and hold limited data: Data size is one of the crucial limiting factor in faster output
- Avoid ‘Select *’ – A representative calculation presented here shows that a million rows with just 5 columns can be as big as 80 MB data. By current data standards, we frequently work on data sets with 10s of millions of rows. What if there were more columns and rows. Add onto this the processing complexity and multiple variables which would hold some subset of data. To avoid the so called ‘endless processing’ here rather than selecting all the columns in the data table, import only necessary columns
- Drop the unnecessary – Drop the columns on which you have already done some processing and are never going to use
- Use ‘where’ clause – Rather than working on the whole blob of data and then running loops on subsets, at times it better to just ingest the data of that one city or one product category. This would greatly boost processing speed as at every step there is lesser quantum of data and also algorithm doesn’t have to work on unnecessary pattern recognition
- Treat frequent and long tail data separately – If there is clear & wide distinction in occurrence frequency of certain data points (e.g. sub categories), at times its better to identify and process long tail data separately from frequent occurring data set. One of the scenarios can be recommendation engines like Market Basket Analysis, where Support and Confidence values influence the processing time. If we use a large value of Support, the outcome would be fast but long tail items could be missed while if we keep it small – at times even 0.001 – will increase the processing time and output rules exponentially. Better to find rules among long tail items separately
- Remove variables – Frequently remove unnecessary and copied variables to free memory [rm()]
- Process data in smaller chunks to avoid memory blocking of long duration
- Loops: The overlooked monster which can plague the system with maximum possible inefficiencies
- Break/Jump loop steps – Always put logic to break the loop, break(), or to jump to next value in loop, next(), so that if a necessary condition is not satisfied, the processing can be terminated without wasting precious time and processing power
- Loop vs apply() – Processing speed of inbuilt functions is much higher compared to using loops. Hence, instead of looping and creating subsets to perform any transformation/ aggregation/ calculation, its generally advisable to use apply(), which() functions or use data.table object rather than dataframe.
- Parallel processing – Unless forced, R would use only one of the cores of your machine by default. Use packages like doParallel and foreach to not only allocate more core explicitly but also promote running the loop in multiple parallel processing branches.
- Shoot down high time consuming loop/code activity – Track the key values. Once the loop starts it becomes very difficult to know the values at every step. Use print () statement to print the in-loop relevant values. It would help in knowing at which value loop is slowing down
- Avoid IF in loops – Till possible check the condition of running the loop, outside the loop
e.g. ConditionOfLoop = (data$col1 + data$col2)>threshold
now take the row number of the ConditionOfLoop for True or False condition, before using the rows in loop
- Convert the format: At times converting the format of the concerned dataset could work wonders in increasing the processing power. For e.g, if there are millions of Rules as outcome of Market Basket Analysis, its better to convert the Rules into dataframe (comma separated entities for longer Rules) and then do the processing