This article has two parts:
- Listing the top 20 experts, along with their Twitter handle, rank in reverse order, number of Twitter followers, and Klout score. We hope to soon see a woman among the top 10.The top woman is currently #11.
- Discussing a robust methodology to score experts
Source for picture: click here
1. The top 20
This is a subset of a bigger list published here. Note that our data scientist is ranked #6.
Rank | Twitter Followers | Score |
---|
20 | Bernard Marr | BernardMarr | 86K | 66.5 |
19 | Jeremy Waite | jeremywaite | 93K | 67.5 |
18 | R Ray Wang | wang0 | 80K | 67.6 |
17 | Hadley Wickham | hadleywickham | 23K | 68 |
16 | Mike Briercliffe | mikejulietbravo | 54K | 68.5 |
15 | Evan Sinar | EvanSinar | 29K | 68.6 |
14 | Bob E. Hayes | bobehayes | 5K | 68.65 |
13 | Dez Blanchfield | dez_blanchfield | 77K | 68.7 |
12 | Andrew Ng | andrewng | 48K | 69.5 |
11 | Hilary Mason | hmason | 68K | 70 |
10 | Gregory Piatetsky | kdnuggets | 48K | 70.35 |
9 | Ronald van Loon | Ronald_vanLoon | 29K | 71.5 |
8 | Hans Rosling | HansRosling | 296K | 72.05 |
7 | Randy Olson | randal_olson | 80K | 73 |
6 | Vincent Granville | analyticbridge | 128K | 73.5 |
5 | Timothy Hughes | Timothy_Hughes | 134K | 73.6 |
4 | Kirk Borne | kirkdborne | 58K | 74 |
3 | Vala Afshar | ValaAfshar | 101K | 78.5 |
2 | Simon Porter | simonlporter | 66K | 80.5 |
1 | Nate Silver | NateSilver538 | 1328K | 81 |
2. Proposed Algorithm to Score Experts
Scores can measure many things: popularity, how influencial someone is in a specific domain, and so on. We have worked on creating various lists over the past few years, typically with a goal different from journalists, rewarding expertise and the volume of quality publications and references, over traditional popularity metrics. We have built various lists of top data science / big data experts:
You should check these three lists and the associated literature, not just out of curiosity, but to discover the methodology used in each case: a methodology designed by a real data scientist, not a black-box tool used by a journalist. Thus our lists are robust, sound and unbiased – or at least the bias is known and disclosed.
Since we have seen lists in the past where the #1 expert was irrelevant, here we propose a 3-steps methodology to build lists and compute scores:
Step #1: Categorize sub-domains (of big data, data science, etc.)
Break the domain into sub-domains. For instance, we established a while back that
Data Science = 0.24 * Data Mining + 0.15 * Machine Learning + 0.14 * Analytics + 0.11 * Big Data
Read this paper to learn about the methodology used to arrive at this equation. Note that weights and even sub-domains evolve over time. And these sub-domains overlap, though that’s not difficult to handle.
Step #2: Categorize experts, and score by sub-domains
Start with a large list of experts, make sure you are not missing any big ones (I have seen lists that were missing the number one expert).
Then categorize these experts according to pre-selected sub-domains (big data, machine learning, and so on in this case). This is performed by
- scraping tons of tweets or blog posts from these experts (or better, from high-score people talking about these experts),
- creating keyword frequency tables,
- extracting (for each expert) keywords associated with the sub-domains,
- and eventually clustering these experts by sub-domains.
This is done using an indexation algorithm. We have used an indexation (or automated tagging) algorithm in a very similar context, to assign sub-categories to 2,500 data science blogs. The methodology is explained in details here. If the data is well structured, you can proceed as here: we were able to determine that Gregory Piatetski-Spapiro and Vincent Granville belongs to a same cluster, while Kirk Borne and Monica Rogati belongs to another, machine learning heavy cluster.
Note: Klout scores (actually ranks) are also available at the sub-domain level, click here for details.
Step #3: Blend scores across sub-domains
Blend the scores obtained at the sub-domain level (in step #2) using the blending formula obtained in step #1.
Caveat: Experts that do not tweet or publish much might not have sub-domain scores that are statistically significant. This can be handled by computing an aggregated score across sub-domains, and ignoring the sub-domain scores. Statistical significance, at the score level, can be computed using the following method.
DSC Resources
- Career: Training | Books | Cheat Sheet | Apprenticeship | Certification | Salary Surveys | Jobs
- Knowledge: Research | Competitions | Webinars | Our Book | Members Only | Search DSC
- Buzz: Business News | Announcements | Events | RSS Feeds
- Misc: Top Links | Code Snippets | External Resources | Best Blogs | Subscribe | For Bloggers
Additional Reading
- The 10 Best Books to Read Now on IoT
- 50 Articles about Hadoop and Related Topics
- 10 Modern Statistical Concepts Discovered by Data Scientists
- Top data science keywords on DSC
- 4 easy steps to becoming a data scientist
- 13 New Trends in Big Data and Data Science
- 22 tips for better data science
- Data Science Compared to 16 Analytic Disciplines
- How to detect spurious correlations, and how to find the real ones
- 17 short tutorials all data scientists should read (and practice)
- 10 types of data scientists
- 66 job interview questions for data scientists
- High versus low-level data science
Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge