Monday, October 6, 2014

Big Data: The Myth debunked

Nearly everyone is talking about Big Data and what used to be a Silicon Valley crusade has become an international quest to find value within corporation's data (through the use of what's called Data Science). The truth is that we're still very far from being there.

1. The most widely used Software do not scale very well

The Lingua Franca for Data Science is really R with an impressive 5,873 packages (you can find pretty much anything, such as Web-scraping, Natural Language Processing, Visualization, etc..). And people like its ease of use and natural syntax for non-programmers as the exponential growth of the number of packages has proved it:

R is fabulous but by no means was it created for Large Scale processing. As a matter of fact, all the data needs to fit in memory, thus severely limiting its use for large scale applications.

Its biggest content, the up-and-coming Python, which has experienced a great scientific contribution in the recent years (Scikit, Numpy, Scipy), has the same limitation (stemming from what's called the Global Interpreter Lock).

2. The Nascent Technologies have restricted uses

Large Scale Processing has 2 very different technologies in contention:

  • MPP (Massively Parallel Processing) databases
  • Hadoop / Map Reduce Eco-system


MPPs originated from, a now very mature SQL technology and added a layer of technology to speed up processing for Analytics mostly via:
1- Columnar Storage: Each column in a database is stored individually to improve performances on aggregation [conducive to Analytics]
2- Compression: An algorithm identified patterns in the data to determine what would be an optimal, loss-free, encoding method, saving a tremendous amount of disk and improving data transfer speed in a cluster on servers
3- Parallel Operations: Due to the intrinsic linear nature of the SQL language, it's usually easy to run the calculations on each part of a data partition and aggregate the results at the end

MPPs are perfect to create features to help a model identify patterns in the data. However, it's limited to simple math, thus not adequate for advanced modeling.

Not surprisingly, the most advanced project (I am aware of) is Madlib which runs on Pivotal products and is restricted to basic linear models

HDFS / Hadoop:

Compared to MPPs, Hadoop provides a more flexible framework that can run Advanced Analytics at scale. Note that this was not the original intent (i.e. Google used it first to process unstructured search terms from very large query logs and create the Suggestion Tooltip, quite a simple operation after all) but it can be utilized to run more complex operations as new projects have shown.

One of the key tenet of Hadoop is to keep the data where it sits to avoid network congestion. As such, projects like Mahout (Machine Learning on HDFS) breaks the traditional barrier between the data storage infrastructure and the Modeling environment allowing to do it all in one space. In that respect, Hadoop aims to be a comprehensive ecosystem.

That being said, this flourishing new world is still in its infancy and has not yet conquered the corporate world. Companies still scratch their head about how to implement HDFS within their current infrastructure and I have heard of notable failures, explained by the scarcity of qualified labor and lack of clear objectives.


Here are my final observations:

1- Modeling activities are still mostly performed R and Python that are limited by data size and there is no sign of slowdown
2- However, MPPs are a cheap, convenient upgrade compared to traditional databases. They allow a good level of feature engineering while not being incapacitated by the data size
3- The future for modeling seems to reside more in the Hadoop ecosystem than in MPPs due to more integrated environment and more flexible framework

1 comment:

  1. Great post. It certainly seems that Hadoop and other Map Reduce implementations will remain at the forefront of computing for some time to come, even as processing power increases for local machines.

    Regarding the limitations of R and Python at scale: many of the limitations are being overcome by talented developers in both communities.

    For R, Revolution Analytics has built an entire business around solving some of these problems. In addition, packages like dplyr, various SQL interfaces like RODBC and RSQLite, bigmemory and anything involving the "ff" class all aim to make on-disk storage and computation compatible with R's popular in-memory interface (see: http://cran.r-project.org/web/views/HighPerformanceComputing.html). There are also some interesting interfaces for parallel computing.

    Although I have less experience with Python, I've seen some very robust models built using Vowpal Wabbit (see: http://www.anlytcs.com/2014/01/an-easy-way-to-bridge-between-python.html). Given the open-source nature of Python (and R), it's reasonable to believe that as limitations are encountered, more and more solutions will become available (or can be built ad-hoc, if you're ambitious enough!)

    Finally, I think Julia deserves a mention somewhere here as the up-and-comer "goldilocks" language of data science (http://julialang.org). One of it's founding principals is a new paradigm of easy access processing distribution.