Data scientists are often considered to be wizards that deliver value from big data. These wizards need to have knowledge in three very distinct subject areas, namely, scalable data management (e.g., data warehousing, Hadoop, parallel processing, query processing, SQL, and storage & resource management), data analysis (e.g., advanced statistics, linear algebra, optimization, machine learning, and mathematics) and domain area expertise (e.g., engineering, logistics, medicine, or physics).
It is clear that in order to tackle a data analysis problem, data scientists must have both data analysis knowledge and domain area expertise. However, it is a challenge to find data analysts, who also possess skills in scalable technologies. Naturally, this is also a requirement, if we are to put big data to good use.
The ever-growing Apache Stack in conjunction with additional technologies and systems make it very hard for data scientists to stay on top. The complexity of data analysis systems and programming is reminiscent of data management in the 1970s, with a hodgepodge of different technologies, low-level programming, little interoperability, and data analysis implementations that are impossible to maintain.
Companies unlike Google, Twitter, and Facebook, which do not have IT as their core business, have a hard time attracting the talent needed to conduct big data analysis. What is missing is a revolutionary technology for big data, similar to what relational algebra offered to data management. That is, a declarative specification of data analysis programs, so that analysts need only specify what kind of analysis they want to conduct and not how the analysis should be computed.
If data analysis were specified using a declarative language, data scientists would not have to worry about low-level programming any longer. Instead, they would be free to concentrate on their data analysis problem. Rather than selecting the right algorithm in Mahout based on data properties or tuning algorithms based on their particular computing infrastructure, the analyst could write a declarative program that runs on any architecture, test it with a data sample on their laptop, and use it on big data sets on a large cluster.
The system could automatically parallelize, optimize, and adapt the declarative data analysis program to a particular computer architecture, such as a compute cluster, manycore in-memory system, or a single laptop. However, in order to do so, we will need the relational algebra equivalent for big data. SQL is not enough given that big data applications often employ complex user-defined functions and iterations in order to compute, for instance, predictive models via regression, support vector machines, or other methods.
In order to progress beyond the current state-of-the-art we will need a mathematical foundation and a declarative language that is abstract enough to facilitate reasoning about a data analysis system, in order to automatically optimize, parallelize, and adapt it to a particular computer infrastructure. If we are successful in achieving this, we would unlock the big data market, similarly as relational algebra and SQL unlocked the multi-billion dollar business intelligence market in the 1990s.
Prof. Volker Markl, Database Systems & Information Management, TU Berlin