Here I offer a few off-the-beaten-path interesting problems that you won’t find in textbooks, data science camps, or in college classes. These problems range from applied maths, to statistics and computer science, and are aimed at getting the novice interested in a few core subjects that most data scientists master. The problems are described in simple English and don’t require math / stats / probability knowledge beyond high school level. My goal is to attract people interested in data science, but who are somewhat concerned by the depth and volume of (in my opinion) unnecessary mathematics included in many curricula. I believe that successful data science can be engineered and deployed by scientists coming from other disciplines, who do not necessarily have a deep analytical background yet are familiar with data. While having myself a strong mathematical background, I have developed an entire data science and machine learning framework (mostly for data science automation) that is almost free of mathematics, and known as deep data science. It is most and foremost relying on data, metrics and algorithm architecture.
The challenges or exercises below are designed in the same spirit: light in mathematics, yet deep, useful or fun, interesting (as opposed to boring) and really challenging. I also follow a bottom-up approach, where questions and conjectures are followed by data collection / data generation and simulations, then by developing hypotheses based on observations, testing them, and finally discovering concepts and principles. It is the opposite of the top-down approach taught in many textbooks and classes, where theory comes first, out of nowhere, followed by practice and applications (if at all). One of the reasons that I favor my approach is because I’ve found that the traditional teaching methodology makes most students – including myself when I was a kid – turned off by mathematics. Furthermore, I would hope that my way of presenting mathematics / statistics (and data science in general) will appeal to women and minorities.
So, let’s start with the fun! You will see that you can learn serious statistical concepts (including limit theorems) without knowing mathematics, much less probabilities or random variables. In fact, without even realizing that you are discovering these advanced analytical concepts. Numerous other challenges can be found here.
1. Simulations, percentiles and Bates distribution
In this challenge (indeed, more like a tutorial), we ask you first to generate n random number vectors (x, y) uniformly distributed on [0, 1] x [0, 1]. In many packages, this is accomplished using the function RAND(), and we provide source code below in this section.If you don’t know how to code, you can start by reading this material.
Once you have generated n bi-variate deviates (x, y), for each generated vector compute the average (x+y)/2. Then sort these n averages, as a first step to compute the (empirical) percentiles. The percentile F(p) for 0 <= p <= 1 is defined as follows.
- Let r be the rank associated with the value (x+y)/2, once these values are sorted. Thus 1 <= r <= n.
- The normalized rank is then defined as s = r / n, and of course, 0 <= s <= 1..
- Then F(p) is the proportion of values (x+y)/2 with a standardized rank satisfying s <= p.
Obviously, we also have 0 <= F(p) <= 1. Indeed F(p) is an increasing function, with F(0) = 0 and F(1) = 1.
Problem
Plot F(p) for various values of n. Do you see a pattern? Does the function F converge to something when n becomes large enough? And what is the limiting distribution, assuming it converges? Below is one plot obtained on simulated data with n = 180,000, with F(p) computed for 100 values of p equally spaced between 0 and 1 (called percentiles.)
Solution
You need to:
- play with your tool (Tableau, R, Python, even Excel will do!) to generate the data and compute F for various p and n and
- possibly transform the data (try multiple transformations on F(p) — of course looking at the square of F(p) will be illuminating in this case) then
- fit the data with a number of pre-specified curves (called trend-lines in Excel) to
- discover what the limiting distribution is.
With Excel for instance, you will see that there is a perfect fit with the power trend-line when 0 <= p <= 1/2 provided n is large enough (n = 1,000 is good enough, very little change is obtained with n = 200,000.) With a bit of extra efforts, you can see that the solution (limiting distribution) is:
- F(p) = SQRT( p / 2 ) if 0 <= p <=0.5
- F(p) = 1 – SQRT( (1-p) / 2 ) if 0.5 em>p <= 1
Below is the source code (written in Perl) that I used to generate the random deviates, compute the averages, sort them, and compute F(p) for 100 values of p and various values of n.
$seed=100;
srand($seed);
for($n=1000; $n<200001; $n+=30000) {
%h=(); # initialize hash table used for sorting
for($iter=0; $iter<$n; $iter++) {
$x=rand();
$y=rand();
$h{$iter}=($x+$y)/2;
}
my @keys = sort { $h{$a} <=> $h{$b} } keys(%h); # sort by numerical value
my @vals = @h{@keys};
open(OUT,”>> dist.txt”);
$step=int($n/100);
for($iter=0; $iter<$n; $iter++) {
if ($iter % $step == 0) {
$p=$iter/$step;
print OUT “$n\t$p\t$vals[$iter]\n”; # print in the sort order
}
}
close(OUT);
}
The code produces a data set called dist.txt which was used as input to Excel, to produce the above figure.
What you have learned here
You may not realize this, but you have somehow mastered the equivalent of 300 pages worth of college textbook (and beyond) by performing this little exercise. Among other things, you
- Performed Monte-Carlo simulations
- Computed empirical statistics associated with random variables
- Performed model-fitting
- Played with the Bates distribution (not found in the average textbook)
- Played with the concept of statistical distribution (the inverse F(p) is the Bates distribution in this case)
- Implicitly navigated the complex field of statistical limit theorems
- Used random number generators (not described in stats 101 textbooks)
- Did some real computations on a computer with a programming language
A next step consists in doing the same exercise with 3 variables (x, y, z) and then with k variables and look at the limit as k tends to infinity. Or try SQRT( x * y ) instead of (x + y)/2. Or try deviates that are different from uniform on [0, 1].
Finally, the above plot can be used to test the quality of your random number generator. If F(p) does not converge as specified and all your computations are correct, then maybe your random number generator is faulty.
2. Computational complexity conjecture
Computational complexity is an important topic in computer science. It is the art and science of measuring how fast or how slow an algorithm is, either on average or for the worst case scenario. For simplicity, let’s say that an algorithm processes a data set with n data points, and that it takes f(n) elementary operations (up to a multiplicative constant) to process the n data points as n becomes very large. In O notation, we say that the complexity of the algorithm is O( f(n)). Typically f(n) can be a power or exponential or logarithm or any appropriate function, see below.
Many algorithms are now optimized, servers are cheap and a slow algorithm implemented using a distributed architecture can outperform efficient algorithms that are difficult to parallelize. Also, sometimes the bottleneck is in the data transfer more than in the algorithmic architecture. Anyway, for algorithms processing large volume of data in nearly real-time, computational complexity is still very important: read my article about how bad so many modern algorithms are and could benefit from some lifting, with faster processing time allowing to take into account more metrics, more data, and more complicated metrics, to provide better results. Also, even for data transfers, encryption algorithms benefit from increased speed / increased security, and thus from being revamped accordingly. Finally, it is not true that all potential algorithms have been invented and optimized already. New algorithms are regularly invented, for instance automated tagging to perform very fast clustering on large unstructured data sets, and old ones may require changes to adapt to new environments (Hadoop, HPC, quantum computers and so on.) though sometimes the architecture adapts to the algorithm itself (graph databases.) In short, computational complexity is not dead — yet.
Computational complexity can be improved using pre-processing techniques, or using vast amounts of memory (RAM).
That said, our focus here is to disprove the following conjecture:
Conjecture
If a linear or super linear algorithm — that is, running at least as fast as O(n) — has complexity O(f(n)), then the function f must necessarily have the form
f(n) = n^p * (log n)^q * (log log n)^r * (log log log n)^s * …
where 0 <= p <= 1, and the leading coefficient (the first coefficient among p, q, r, s … not equal to zero) is strictly positive. Algorithms running in O(1), that is in constant time regardless of size (yes they do exist!), are not considered here. If you allow the coefficient p to be larger than one, then the conjecture covers a much larger class of algorithms, including P-problems. Any problem not solvable with a computational complexity satisfying the above condition (with p possibly larger than one), is NP-hard. For instance, factoring an integer is believed to be NP-hard. See also the NP versus P dilemma.
By disproving the conjecture, we are looking here for “computational complexity” functions f that do not satisfy the above characterization. The conjecture was mentioned in the context of measuring the density of an infinite set of integers, click here for details. The two first subsections below provide attempts at disproving the conjecture, that is, finding counter-examples.
Two special functions
Here I consider two rather unusual functions.
- f(n) = log(n + log(n + log(n + …)))
- g(n) = log(n + n log(n + n log(n + …)))
Are these two functions counter-examples to the conjecture? Before reading further, try to answer this question by yourself. This may involve computing many values of f(n) and g(n) and visually check the asymptotic behavior. Note that for a specific n, the computation of f(n) and g(n) is extremely fast, gaining a few decimals at each iteration.
The answer for f is rather easy: f satisfies f(n) = log(n + f(n)) and thus, clearly f(n) = O(log n), indeed f(n) / log n tends to 1 as n tends to infinity. So the conjecture is not violated. For g, we have g(n) = log(n + n g(n)), resulting in the same conclusion. Indeed, for large n, a very good approximation is g(n) = log n + log log n + ….
Can you find a function that would violate the conjecture?
Two even more special functions
Now let us consider two incredibly slow growing functions. Let’s start with f defined by the recursion
f(x) = 1 + f(log x), with f(1) = 0, f(e) = 1, f(e^e) = 2, f(e^{e^e}) = 3, and so on.
We can make f a continuous function by requiring that it is piecewise linear between 0 and 1, between 1 and e, and so on.
It looks like f(n), as n tends to infinity, is infinitely smaller than log n, log(log n), log(log(log n))), and so on, no matter how many (finite number of) nested log’s you have. Yet f is not a constant either, actually as n tends to infinity, f(n) also tends to infinity. Is f a function that violates the above conjecture — a counter-example? If not, what above an even far more slow growing function g defined as g(x) = 1 + g(f(x)) where f is the function that we have just discussed?
Note that an algorithm with computational complexity as great as O(f(n)) or O(g(n)) would be incredibly fast.
Algorithms, big data, and model fitting in number theory
Modern number theory extensively deals with big data such as billions of decimals of Pi or billions of prime numbers, and about how to generate them efficiently. It relies on algorithms, some implemented in a distributed environment, not only to discover or validate new results, but also for practical business or security problems such as data encryption, random number generation, reverse engineering (hacking), or fraud detection. Many unproved, fascinating results, as well as a cool algorithm with known computational complexity can be found here.This is still an area where intellectual challenges, easy to state in simple English but difficult to solve, are plentiful, especially if you read the article in question, and you are ready to spend a lot of time designing and testing efficient algorithms, to generate vast amounts of data for testing or discovery purposes.
3. Fractional calculus
Here are two fun problems, also dealing with unusual functions or operators as in the previous section.
Half derivatives
Let us denote as DF the usual derivative operator, that is, DF(f(x)) = df(x) / dx = f'(x). The half derivative is an operator DH such that DH(DH(f(x)) = DF(f(x)), so that DH is sometimes denoted as DF^{1/2} or SQRT(DF). In algebraic notation, we have DH o DH = DF. For a function f that can be represented by a series f(x) = SUM{ a(k) x^k } where the sum is over all positive integer k, one can define DH as a linear operator such that DH(x^k) = c(k) * x^{k-1/2}. Since DH o DH = DF, we must have
c(k) * DH(x^{k-1/2}) = c(k) * c(k-1/2) * x^{k-1} = DF(x^k) = k * x^{k-1}, that is,
c(k) * c(k-1/2) = k.
The previous recursion on the c(k) coefficients eventually defines DH up to a multiplicative constant.
Exercise: compute DH explicitly (general formula) and provide examples of functions f(x) where DH(f(x)) can be computed in closed form.
Solution: an elegant solution can be found here. It has a simple expression:
where alpha = 1/2 corresponding to DH.
Another square root operator, and the fixed point algorithm
How would you solve the equation g(g(x)) = f(x), where f is known and g is the unknown function? In algebraic notation, it is written as g o g = f, or g = f^{1/2}. It presents some similarities with the previous fractional calculus problem. What is the solution if f(x) = log(x)? Would it be a function that violates the conjecture in section 2?
I do not provide an answer here, but I suggest the following iterative algorithm, based on the fixed-point theorem, to solve this kind of problem.
- Start with g(x; 0) = h(x), a function yet to be determined to guarantee convergence, maybe h(x) = f(x) or h(x) = 0 or h(x) = x.
- Iteratively compute g(x; n+1) = f(x) – g(g(x; n); n) + g(x; n) for n = 0, 1, 2, …
If convergence, then g(x) is the limit of g(x; n) as n tends to infinity.
One final world: it is better to code this algorithm not using recursivity. Recursivity-based implementations (as offered by programming languages) tend to eat a lot of memory, are very slow, and not easily adapted to distributed environments. Instead, at each iteration, use arrays to store the values of g(x; n) and g(x; n+1) for 20,000 values of x equally spaced between -10 and +10, and update these arrays accordingly. These arrays will be sparse until n gets really big, and that’s OK. For instance, when at some point in the iterative process you need to get the previous value computed at an argument equal to (say) x = 1.225193, instead use the value available in memory for the closest argument; in this case, it will be for x = 1.230000 when n is large enough.
Top DSC Resources
- Article: What is Data Science? 24 Fundamental Articles Answering This Question
- Article: Hitchhiker’s Guide to Data Science, Machine Learning, R, Python
- Tutorial: Data Science Cheat Sheet
- Tutorial: How to Become a Data Scientist – On Your Own
- Categories: Data Science – Machine Learning – AI – IoT – Deep Learning
- Tools: Hadoop – DataViZ – Python – R – SQL – Excel
- Techniques: Clustering – Regression – SVM – Neural Nets – Ensembles – Decision Trees
- Links: Cheat Sheets – Books – Events – Webinars – Tutorials – Training – News – Jobs
- Links: Announcements – Salary Surveys – Data Sets – Certification – RSS Feeds – About Us
- Newsletter: Sign-up – Past Editions – Members-Only Section – Content Search – For Bloggers
- DSC on: Ning – Twitter – LinkedIn – Facebook – GooglePlus
Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge