The next generation of data analysis requires the next generation of tools. The most popular opensource packages for data analysis (Python’s pandas and various R packages) are designed to work with small files of basic data types, but ‘small’ and ‘basic’ do not describe the data landscape of the future. The amount of data in the world is growing exponentially, and as The Economist observes, it’s changing as it grows:
“The quality of data has changed, too. They are no longer mainly stocks of digital information – databases of names and other well-defined personal data, such as age, sex, and income. The new economy is more about analyzing rapid realtime flows of often unstructured data: the streams of photos and videos generated by users of social networks, the reams of information produced by commuters on their way to work, the flood of data from hundreds of sensors in a jet engine.”
The current generation tools therefore face a number of difficulties in analyzing the next generation of data. The first is that of scale, which can be achieved with distributed computing systems like Hadoop and Spark, but loses the ease of use that make Python and R tools attractive.
Scaling an analysis also adds costs in the form of gluing together tools that may not support the same data types or operations (e.g., Spark DataFrame to Pandas DataFrame to numpy array to scikit-learn model). Another issue for current databases is storing nonstandard data types. A database can sometimes work around unsupported types (e.g., units and currencies) by attaching metadata to a field, but the same approach is harder to apply to more complicated data like images and video. The next-generation database should therefore offer the features that are lacking in the current generation:
- Scalability (works equally well on Small and Big Data)
- Ease of use (no need to glue together different formats)
- Flexibility (stores data types that may not exist yet).
JuliaDB aims to be the analytics database of the future. It is implemented entirely in Julia, a high-performance language for technical computing designed around modern technologies such as just-in-time compilation, type inference, and parallelism.
Logged-in members can download the article by clicking the link under all the “Related Posts” below. If there isn’t a link then you aren’t logged in! To log in or register visit here.