Monday, December 22, 2025

Datumbox Machine Studying Framework 0.7.0 Launched


I’m actually excited to announce that, after a number of months of growth, the brand new model of Datumbox is out! The 0.7.0 model brings multi-threading help, quick disk-based coaching for datasets that don’t slot in reminiscence, a number of algorithmic enhancements and higher structure. Obtain it now from Github or Maven Central Repository.

The main target of model 0.7.0 is to lastly convey multi-threading help to the framework and make the disk-based coaching extremely quick. Furthermore it brings a number of algorithmic enhancements in all of the Regression-based algorithms, the Collaborative Filtering mannequin and the N-grams extractor which is utilized in NLP purposes. The structure of the framework has been redesigned to separate the mission into a number of modules (word that the artifactId of the primary library is now datumbox-framework-lib) and to simplify its construction. Lastly the brand new model brings a number of code enhancements, higher documentation within the type of javadocs and improved take a look at protection.

The 0.7.0 model of the framework shouldn’t be backwards suitable with the 0.6.x department. It is because main redevelopment was vital so as to add the brand new options and enhance & simplify the structure of the framework. Beneath I focus on intimately the brand new options:

Multi-threading help

The brand new framework is a number of occasions sooner than the 0.6.x department. This was achieved by utilizing threads, by doing heavy profiling on the recent spots of the code and by rewriting core parts to allow non-blocking concurrent reads/writes. At the moment threads are being utilized in all of the algorithms that may be parallelized which is almost all of the supported fashions of the framework. The parallel execution is supported each throughout coaching and testing/predicting.

The mission makes use of heaps Java 8 options so as to scale back the verbosity of the code, enhance readability and modernize the code-base. Word that though the framework makes heavy use of streams, all duties are executed in their very own ForkJoinPool to make sure that they won’t get caught. The extent of parallelism is managed both by altering programmatically the ConcurrencyConfiguration object or by configuring the datumbox.config.properties file.

Disk-based Coaching

Though disk-based coaching (coaching fashions with out loading the info in reminiscence) was doable for the reason that 0.6.0 model, it was so sluggish that made the characteristic virtually unusable. In model 0.7.0, the Storage Engine mechanism was redeveloped to allow a hybrid strategy of storing the recent/frequently accessed information in reminiscence & LRU cache whereas retaining the remaining on disk. This strategy makes the disk-based coaching very quick and it must be most well-liked even in circumstances the place the info barely slot in reminiscence (clearly if the info match simply in RAM, the default in-memory coaching must be most well-liked). As within the earlier model, the reminiscence storage configuration might be modified programmatically by altering the suitable DatabaseConfiguration objects or by configuring the datumbox.config.properties file.

At this level I wish to level out that this characteristic wouldn’t have been doable with out the superb work accomplished by Jan Kotek on MapDB. MapDB is an embeded Java database engine which supplies concurrent Maps backed by disk storage and off-heap-memory. Utilizing his open-source library, I used to be in a position to develop a Storage Engine which allows Datumbox to deal with a number of GB value of coaching knowledge on my laptop computer with out loading them in reminiscence.

Algorithmic Enhancements

The brand new model provides help of L1, L2 and ElasticNet regularization within the SoftMaxRegression (Multinomial Logistic Regresion), OrdinalRegression and NLMS (Linear Regression) fashions. Which means by utilizing the identical customary courses one can carry out Ridge Regression, Lasso Regression or make use of Elastic Nets. Furthermore within the new model the Collaborative Filtering algorithm was modified to help extra generic Person-user CF fashions. Lastly the NgramsExtractor algorithm was rewritten to make it in a position to export extra key phrases and supply higher scores.

Framework Structure & Code Enhancements

One other vital replace on the brand new framework is the truth that the mission was break up into a number of sub-modules. Beneath I record the presently supported modules named after their artifactIds:

  1. datumbox-framework-common: It comprises crucial interfaces, helper and utility courses, knowledge buildings and mechanisms of the framework. This module doesn’t include any algorithms however it’s the base of the framework.
  2. datumbox-framework-core: It consists of the three fundamental layers of the framework (Machine Studying, Statistics and Arithmetic) together with the utilities layer. This module comprises all of the algorithms, strategies and statistical assessments of the framework.
  3. datumbox-framework-applications: It comprises a listing of courses that are construct to supply off-the-shelf options for frequent machine studying issues similar to Textual content Classification, Knowledge Modelling and so forth. All of the courses of the module are constructed on high of the core module.
  4. datumbox-framework-lib: That is the Datumbox Machine Studying Framework! Word that the artifactId of the library modified from “datumbox-framework” to “datumbox-framework-lib” because of the restructuring.

Along with the above modules, we have now the “datumbox-framework” father or mother module which is not the Java library however merely teams collectively all of the sub-modules below the identical mission. As a way to use the brand new framework on Maven tasks add in your pom.xml the next traces:

<dependencies>
   ...
   <dependency>
       <groupId>com.datumboxgroupId>
       <artifactId>datumbox-framework-libartifactId>
       <model>0.7.0model>
   dependency>
   ...
dependencies>

The brand new model brings main adjustments on the construction of framework, the interfaces and inheritance with fundamental aim to simplify and enhance its structure. One of many breaking adjustments that have been launched on the brand new framework is the deprecation of the outdated Dataset class (which was used to retailer all of the coaching and testing knowledge within the framework) and the introduction of the Dataframe class. The Dataframe class implements the Assortment interface, permits the modification and deletion of information and allows the processing of the information in parallel. One other vital change is the truth that the BaseMLrecommender, which is the bottom class for all Recommender System algorithms, now inherits from BaseMLmodel.

Along with the above adjustments the framework consists of some code enhancements and bug fixes: A serialVersionUID is added in each serializable class, the Exceptions and error messages have been improved and so do the javadocs documentation and the test-coverage. For extra details about the updates of the brand new model take a look on the Changelog.

Datumbox 0.7.0 has accomplished a number of vital milestones of the initially proposed roadmap. The event of the framework will proceed within the following months to cowl the next targets:

  1. Entry the Framework through Console or Python: The framework ought to turn out to be extra accessible to non-Java builders. To realize this it ought to present entry to the algorithms through the command line or by providing an API in different languages like Python.
  2. New Machine Studying algorithms: Because the structure of the framework turns into extra mature, it is going to be simpler to extend the variety of supported algorithms and embrace fashions similar to Combination of Gaussians, Gaussian Processes, k-NN, Choice Timber, Random Forests, Issue Evaluation, SVD, Factorization Machines, Synthetic Neural Networks and so forth.
  3. Extra Storage Engines: Extra choices must be supplied to the customers of the framework to retailer their fashions and prepare their algorithms with out loading all the info in reminiscence. Furthermore higher instruments must be offered to those that wish to transfer a mannequin from one storage engine to the opposite.
  4. Enhance Documentation, Check protection & Code examples: Though the javadocs and take a look at protection enhance in every launch, the documentation of the framework remains to be poor. Subsequent variations ought to present a greater documentation, higher test-coverage and extra examples on easy methods to use the supported algorithms.

On condition that I’ve a full-time job, I count on that the event of the framework will proceed on the identical price, releasing a brand new model each 4-6 months. If you need to suggest a brand new milestone be happy to open a difficulty on the official Github repository. Final however not least, in the event you use the mission please take into account contributing. It doesn’t matter in case you are a ninja Java Developer, a rock-star Knowledge Scientist or an influence person of the library; I can use all the assistance I can get so be happy to get in contact with me.

As soon as once more I wish to thank my good friend and colleague Eleftherios Bampaletakis for serving to me enhance the structure of the framework, his suggestions was invaluable. Additionally I wish to thank Jan Kotek for providing free consulting on easy methods to use effectively MapDB and for open-sourcing such a tremendous product. Furthermore plenty of because of ej-technologies GmbH and JetBrains for offering licenses for his or her superb instruments JProfiler and IntelliJ IDEA; they each provide superb merchandise that helped rather a lot the event of the framework. Final however not least, I’ll prefer to thank the love of my life, Kyriaki, for supporting and placing up with me whereas writing the mission.

 

Don’t overlook to clone the code of Datumbox v0.7.0 from Github. The library is accessible additionally on Maven Central Repository. Additionally take a look on the Detailed Set up Information and on the Code Examples to seek out out extra on easy methods to use the framework.

I’m trying ahead to your feedback and proposals. Pull requests are at all times welcome! 🙂

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles