Data Scientists – The Sexiest Job of the 21st Century?

  • Posted on: 14 April 2014
  • By: Admin

Data scientists are often considered to be wizards that deliver value from big data. These wizards need to have knowledge in three very distinct subject areas, namely, scalable data management (e.g., data warehousing, Hadoop, parallel processing, query processing, SQL, and storage & resource management), data analysis (e.g., advanced statistics, linear algebra, optimization, machine learning, and mathematics) and domain area expertise (e.g., engineering, logistics, medicine, or physics).

It is clear that in order to tackle a data analysis problem, data scientists must have both data analysis knowledge and domain area expertise. However, it is a challenge to find data analysts, who also possess skills in scalable technologies. Naturally, this is also a requirement, if we are to put big data to good use.

The ever-growing Apache Stack in conjunction with additional technologies and systems make it very hard for data scientists to stay on top. The complexity of data analysis systems and programming is reminiscent of data management in the 1970s, with a hodgepodge of different technologies, low-level programming, little interoperability, and data analysis implementations that are impossible to maintain.

Companies unlike Google, Twitter, and Facebook, which do not have IT as their core business, have a hard time attracting the talent needed to conduct big data analysis. What is missing is a revolutionary technology for big data, similar to what relational algebra offered to data management. That is, a declarative specification of data analysis programs, so that analysts need only specify what kind of analysis they want to conduct and not how the analysis should be computed.

If data analysis were specified using a declarative language, data scientists would not have to worry about low-level programming any longer. Instead, they would be free to concentrate on their data analysis problem. Rather than selecting the right algorithm in Mahout based on data properties or tuning algorithms based on their particular computing infrastructure, the analyst could write a declarative program that runs on any architecture, test it with a data sample on their laptop, and use it on big data sets on a large cluster.

The system could automatically parallelize, optimize, and adapt the declarative data analysis program to a particular computer architecture, such as a compute cluster, manycore in-memory system, or a single laptop. However, in order to do so, we will need the relational algebra equivalent for big data. SQL is not enough given that big data applications often employ complex user-defined functions and iterations in order to compute, for instance, predictive models via regression, support vector machines, or other methods.

In order to progress beyond the current state-of-the-art we will need a mathematical foundation and a declarative language that is abstract enough to facilitate reasoning about a data analysis system, in order to automatically optimize, parallelize, and adapt it to a particular computer infrastructure. If we are successful in achieving this, we would unlock the big data market, similarly as relational algebra and SQL unlocked the multi-billion dollar business intelligence market in the 1990s.

Prof. Volker Markl, Database Systems & Information Management, TU Berlin

Data as the New Oil – Ecosystems and Political Meaning

  • Posted on: 4 April 2014
  • By: Admin

In this blog post, I will discuss some economic, societal, and legal issues concerning big data. One often hears that ”data is the new oil.” Like oil, data is a complex product derived from numerous processing and refinement steps and an entire economic ecosystem involving drilling stations, refineries and distribution networks, which include filling/gas stations. Similarly, one can draw an analogy for the big data realm.

Data drilling stations are, for example, information extraction and integration methods, which extract and enrich semantics from crude data. The refineries are data analysis and mining algorithms, systems, and tools, which cluster, group, and characterize the data in a new way in order to derive insight and actionable information. We already see an entire economy of distribution networks emerging around big data, with information marketplaces that sell transformed, semantically enriched, and further augmented forms of data.

However, just as with oil, there is a huge political dimension. Data is a critical information economy resource. Countries with easy access to crude data, who exercise control over important parts of the distribution networks, will emerge as leaders. It is therefore in each country’s interest to invest in data hubs to provide citizens, enterprises, research institutions, and administration with access to huge amounts of data and data analysis methods.

Data hubs are not to be confused with linked open data or the semantic web, as they differ both in intent and technological approach. Instead of linking data, connections in data hubs are established dynamically based on a particular analysis need (e.g., using data analysis methods drawn from relational algebra, statistics, signal processing, or machine learning).

Ethical standards, legal frameworks, international treaties, and agreements will be important not only for economic growth, but for society as a whole. In order to shape a global ecosystem and big data economy, close cooperation between governments and industry stakeholders will be required.

The Five Dimensions of Big Data

  • Posted on: 28 March 2014
  • By: Admin

Today, analysts seek to derive insight from large, heterogeneous, high-velocity (i.e., big) data sets using varying data analysis methods. These data sets are ubiquitous. They arise due to burgeoning cloud computing services, the anticipated Internet of Services (IoS), and the emerging Internet of Things (IoT). Big data is often defined as any data set that cannot be handled using today’s widely available mainstream solutions, techniques, and technologies.

Currently, there is a shift towards data-driven decision-making in both industry and the sciences. This trend was described in a big data study we recently conducted for the German BMWi (formerly the Federal Ministry of Economics and Technology, today known as the Federal Ministry for Economic Affairs and Energy). In the study, big data challenges and opportunities were classified into five dimensions, namely, technology, application, economic, legal, and social. These are described below.

Technology. There is a need for scalable systems and platforms for data analysis, novel data analysis methods, and in particular technologies to help overcome the skills gap (e.g., enabling data analysis methods to be accessible to a wider audience).

Application. Many novel applications are emerging in the information economy, such as information marketplaces, which refine and sell enriched data. These information marketplaces are effectively bootstrapping the information economy. Other examples include personalized medicine, Industry 4.0, and digital humanities.

Economic. The challenges and opportunities in the economic dimension lie in new business models and content delivery paradigm shifts (e.g., information pricing and the role of open¬-source software).

Legal. From a legal perspective, big data will present many challenges with respect to ownership, liability, and insolvency, in addition to prevalent issues, such as privacy and security.

Social. Lastly, data driven innovation will have a profound impact on society as a whole with respect to social interaction, news, and democratic processes, among others.

The German Study and an English summary can be found at Big Data Management Report