publications | Machine Learning and Data Engineering

Trainable Bitwise Soft Quantization for Input Feature Compression

K. Schrödter, J. Stenkamp, N. Herrmann, and F. Gieseke

CPAL26 Third Conference on Parsimony and Learning (CPAL) 2026

Abs Bib

The growing demand for machine learning applications in the context of the Internet of Things calls for new approaches to optimize the use of limited compute and memory resources. Despite significant progress that has been made w.r.t. reducing model sizes and improving efficiency, many applications still require remote servers to provide the required resources. However, such approaches rely on transmitting data from edge devices to remote servers, which may not always be feasible due to bandwidth, latency, or energy constraints. We propose a task-specific, trainable feature quantization layer that compresses the input features of a neural network. This can significantly reduce the amount of data that needs to be transferred from the device to a remote server. In particular, the layer allows each input feature to be quantized to a user-defined number of bits, enabling a simple on-device compression at the time of data collection. The layer is designed to approximate step functions with sigmoids, enabling trainable quantization thresholds. By concatenating outputs from multiple sigmoids, introduced as bitwise soft quantization, it achieves trainable quantized values when integrated with a neural network. We compare our method to full-precision inference as well as to several quantization baselines. Experiments show that our approach outperforms standard quantization methods, while maintaining accuracy levels close to those of full-precision models. In particular, depending on the dataset at hand, compression factors of to can be achieved without significant performance loss.
@inproceedings{schroedter2026bitwisequant, title = {Trainable Bitwise Soft Quantization for Input Feature Compression}, author = {Schrödter, Karsten and Stenkamp, Jan and Herrmann, Nina and Gieseke, Fabian}, booktitle = {Third Conference on Parsimony and Learning (CPAL)}, year = {2026}, tags = {ml,application,rs}, projects = {ai4forest} }
Counting Parked Bicycles on the Edge - A TinyML Smart City Application

J. Stenkamp, M. Hunke, C. Karatas, S. Kirchhoff, C. Knaden, P. Naebers, L. Zhao, B. Karic, F. Gieseke, and N. Herrmann

SENSYS26 ACM/IEEE International Conference on Embedded Artificial Intelligence and Sensing Systems (SenSys) 2026

Abs Bib

As cities strive to reduce car dependency and promote sustainable transportation, encouraging bicycle usage becomes a vital part of the urban planning process. The existence of a sufficient number of bicycle storage facilities is a key building block, as it reduces the likelihood of bicycle theft and the necessity for bicycle repairs. By monitoring the utilization of bicycle parking lots, supply shortfalls can be detected, and users can be informed about the availability of slots. However, detection systems face multiple challenges. Equipping every parking slot with individual sensors is costly, and transmitting visual data can raise privacy concerns or even discourage users. To address this problem, embedded machine learning can be used to process visual data locally and transmit only the resulting count to a central server. This work sets out a real-world use case for microcontrollers that are equipped with a camera and an embedded machine learning model for the purpose of counting parked bicycles. A custom dataset was collected and labeled to train an object-detection model, which was subsequently compressed and deployed on an ESP32-S3 microcontroller that processes the image data locally and transmits only the bicycle count to a remote server via LoRaWAN. The model compression incurs only a marginal performance degradation, with the compressed model still achieving an AP@50 of 0.91. Hence, our approach demonstrates the practical realization of recent theoretical advances in tiny machine learning and provides a viable solution for monitoring bicycle parking facilities in real-world settings.
@inproceedings{Stenkampcountingparkedbicycles, title = {Counting Parked Bicycles on the Edge - A TinyML Smart City Application}, author = {Stenkamp, Jan and Hunke, Mathis and Karatas, Cem and Kirchhoff, Steffen and Knaden, Christoph and Naebers, Paul and Zhao, Lige and Karic, Benjamin and Gieseke, Fabian and Herrmann, Nina}, booktitle = {ACM/IEEE International Conference on Embedded Artificial Intelligence and Sensing Systems (SenSys)}, year = {2026}, tags = {ml,application,rs}, projects = {tinyaiot} }
Boosted Trees on a Diet: Compact Models for Resource-Constrained Devices

N. Herrmann, J. Stenkamp, B. Karic, S. Oehmcke, and F. Gieseke

ICLR26 The Fourteenth International Conference on Learning Representations (ICLR) 2026

Abs Bib

Deploying machine learning models on compute-constrained devices has become a key building block of modern IoT applications. In this work, we present a compression scheme for boosted decision trees, addressing the growing need for lightweight machine learning models. Specifically, we provide techniques for training compact boosted decision tree ensembles that exhibit a reduced memory footprint by rewarding, among other things, the reuse of features and thresholds during training. Our experimental evaluation shows that models achieved the same performance with a compression ratio of 4–16x compared to LightGBM models using an adapted training process and an alternative memory layout. Once deployed, the corresponding IoT devices can operate independently of constant communication or external energy supply, and, thus, autonomously, requiring only minimal computing power and energy. This capability opens the door to a wide range of IoT applications, including remote monitoring, edge analytics, and real-time decision making in isolated or power-limited environments.
@inproceedings{treesonadiet, title = {Boosted Trees on a Diet: Compact Models for Resource-Constrained Devices}, author = {Herrmann, Nina and Stenkamp, Jan and Karic, Benjamin and Oehmcke, Stefan and Gieseke, Fabian}, booktitle = {The Fourteenth International Conference on Learning Representations (ICLR)}, year = {2026}, tags = {ml,de,energy,rs}, projects = {tinyaiot}, url = {https://openreview.net/forum?id=batDcksZsh} }
Canopy Tree Height Estimation using Quantile Regression: Modeling and Evaluating Uncertainty in Remote Sensing

K. Schrödter, J. Pauls, and F. Gieseke

AISTATS26 Twenty-Ninth Annual Conference on Artificial Intelligence and Statistics (AISTATS) 2026

Abs Bib

Accurate tree height estimation is vital for ecological monitoring and biomass assessment. We apply quantile regression to existing tree height estimation models based on satellite data to incorporate uncertainty quantification. Most current approaches on tree height estimation rely on point predictions, which limits their applicability in risk-sensitive scenarios. In this work, we show that with minor modifications to the prediction head, existing models can be adapted to provide statistically calibrated uncertainty estimates via quantile regression. Furthermore, we demonstrate how our results correlate with known challenges in remote sensing (e.g., terrain complexity, vegetation heterogeneity), indicating that the model is less confident in more challenging conditions.
@inproceedings{schroedter2026uncertaintytree, title = {Canopy Tree Height Estimation using Quantile Regression: Modeling and Evaluating Uncertainty in Remote Sensing}, author = {Schrödter, Karsten and Pauls, Jan and Gieseke, Fabian}, booktitle = {Twenty-Ninth Annual Conference on Artificial Intelligence and Statistics (AISTATS)}, year = {2026}, tags = {ml,application,rs}, projects = {ai4forest} }
Retrieving yearly forest growth from satellite data: A deep learning based approach

M. Schwartz, P. Ciais, E. Sean, A. de Truchis, C. Vega, N. Besic, I. Fayad, J. Wigneron, S. Brood, A. Pelissier-Tanon, J. Pauls, G. Belouze, and Y. Xu

Remote Sensing of Environment 2025

Abs Data

High-resolution mapping of forest attributes is crucial for ecosystem monitoring and carbon budget assessments. Recent advancements have leveraged satellite imagery and deep learning algorithms to generate high-resolution forest height maps. While these maps provide valuable snapshots of forest conditions, they lack the temporal resolution to estimate forest-related carbon fluxes or track annual changes. Few studies have produced annual forest height, volume, or biomass change maps validated at the forest stand level. To address this limitation, we developed a deep learning framework, coupling data from Sentinel-1 (S1), Sentinel-2 (S2) and from the Global Ecosystem Dynamics Investigation (GEDI) mission, to generate a time series of forest height, growing stock volume, and aboveground biomass at 10 to 30-m spatial resolution that we refer to as FORMS-T (FORest Multiple Satellite Time series). Unlike previous studies, we train our model on individual S2 scenes, rather than on growing season composites, to account for acquisition variability and improve generalization across years. We produced these maps for France over seven years (2018–2024) for height at 10 m resolution and further converted them to 30 m maps of growing stock volume and aboveground biomass using leaf type-specific allometric equations. Evaluation against the French National Forest Inventory (NFI) showed an average mean absolute error of 3.07 m for height (r2=0.68) across all years, 86 m3 ha-1 for volume and 65.1 Mg ha-1 for biomass. We further evaluated FORMS-T capacity to capture growth on a site where two successive airborne laser scanning (ALS) campaigns were available, showing a good agreement with ALS data when aggregating at coarser spatial resolution (r2=0.60, MAE=0.27 m for the 2020–2022 growth of trees between 10 and 15 m in 5 km pixels). Additionally, we compared our results to the NFI-based wood volume production at regional level and obtained a good agreement with a MAE of 1.45 m3 ha-1 yr-1 and r2 of 0.59. We then leveraged our height change maps to derive species-specific growth curves and compared them to ground-based measurements, highlighting distinct growth dynamics and regional variations in forest management practices. Further development of such maps could contribute to the assessment of forest-related carbon stocks and fluxes, contributing to the formulation of a comprehensive carbon budget at the country scale, and supporting global efforts to mitigate climate change.
DUNIA: Pixel-Sized Embeddings via Cross-Modal Alignment for Earth Observation Applications

I. Fayad, M. Zimmer, M. Schwartz, P. Ciais, F. Gieseke, G. Belouze, S. Brood, A. De Truchis, and A. d’Aspremont

ICML25 42nd International Conference on Machine Learning (ICML) 2025

Abs Bib HTML PDF Code

Significant efforts have been directed towards adapting self-supervised multimodal learning for Earth observation applications. However, existing methods produce coarse patch-sized embeddings, limiting their effectiveness and integration with other modalities like LiDAR. To close this gap, we present DUNIA, an approach to learn pixel-sized embeddings through cross-modal alignment between images and full-waveform LiDAR data. As the model is trained in a contrastive manner, the embeddings can be directly leveraged in the context of a variety of environmental monitoring tasks in a zero-shot setting. In our experiments, we demonstrate the effectiveness of the embeddings for seven such tasks (canopy height mapping, fractional canopy cover, land cover mapping, tree species identification, plant area index, crop type classification, and per-pixel waveform-based vertical structure mapping). The results show that the embeddings, along with zero-shot classifiers, often outperform specialized supervised models, even in low data regimes. In the fine-tuning setting, we show strong low-shot capabilities with performances near or better than state-of-the-art on five out of six tasks.
@inproceedings{fayad2025dunia, title = {DUNIA: Pixel-Sized Embeddings via Cross-Modal Alignment for Earth Observation Applications}, author = {Fayad, Ibrahim and Zimmer, Max and Schwartz, Martin and Ciais, Philippe and Gieseke, Fabian and Belouze, Gabriel and Brood, Sarah and De Truchis, Aurelien and d'Aspremont, Alexandre}, booktitle = {42nd International Conference on Machine Learning (ICML)}, year = {2025}, tags = {ml,application}, projects = {ai4forest} }
Capturing Temporal Dynamics in Large-Scale Canopy Tree Height Estimation

J. Pauls, M. Zimmer, B. Turan, S. Saatchi, P. Ciais, S. Pokutta, and F. Gieseke

ICML25 42nd International Conference on Machine Learning (ICML) 2025

Abs Bib HTML PDF Code Earth Engine

With the rise in global greenhouse gas emissions, accurate large-scale tree canopy height maps are essential for understanding forest structure, estimating above-ground biomass, and monitoring ecological disruptions. To this end, we present a novel approach to generate large-scale, high-resolution canopy height maps over time. Our model accurately predicts canopy height over multiple years given Sentinel-2 time series satellite data. Using GEDI LiDAR data as the ground truth for training the model, we present the first 10m resolution temporal canopy height map of the European continent for the period 2019-2022. As part of this product, we also offer a detailed canopy height map for 2020, providing more precise estimates than previous studies. Our pipeline and the resulting temporal height map are publicly available, enabling comprehensive large-scale monitoring of forests and, hence, facilitating future research and ecological analyses. For an interactive viewer, see this https URL.
@inproceedings{pauls2025capturing, title = {Capturing Temporal Dynamics in Large-Scale Canopy Tree Height Estimation}, author = {Pauls, Jan and Zimmer, Max and Turan, Berkant and Saatchi, Sassan and Ciais, Philippe and Pokutta, Sebastian and Gieseke, Fabian}, booktitle = {42nd International Conference on Machine Learning (ICML)}, year = {2025}, custom = {Earth Engine|https://europetreemap.projects.earthengine.app/view/temporalcanopyheight}, tags = {ml, application}, projects = {ai4forest} }
Predictability of abrupt shifts in dryland ecosystem functioning

P. N. Bernardino, W. D. Keersmaecker, S. Horion, R. V. D. Kerchove, S. Lhermitte, R. Fensholt, S. Oehmcke, F. Gieseke, K. V. Meerbeek, C. Abel, J. Verbesselt, and B. Somers

JOURNALNature Climate Change 2025

Abs Bib HTML

Climate change and human-induced land degradation threaten dryland ecosystems, vital to one-third of the global population and pivotal to inter-annual global carbon fluxes. Early warning systems are essential for guiding conservation, climate change mitigation and alleviating food insecurity in drylands. However, contemporary methods fail to provide large-scale early warnings effectively. Here we show that a machine learning-based approach can predict the probability of abrupt shifts in Sudano–Sahelian dryland vegetation functioning (75.1% accuracy; 76.6% precision) particularly where measures of resilience (temporal autocorrelation) are supplemented with proxies for vegetation and rainfall dynamics and other environmental factors. Regional-scale predictions for 2025 highlight a belt in the south of the study region with high probabilities of future shifts, largely linked to long-term rainfall trends. Our approach can provide valuable support for the conservation and sustainable use of dryland ecosystem services, particularly in the context of climate change projected drying trends.
@article{BernardinoKHKLFOGMAVS2025, author = {Bernardino, Paulo Negri and Keersmaecker, Wanda De and Horion, Stéphanie and Kerchove, Ruben Van De and Lhermitte, Stef and Fensholt, Rasmus and Oehmcke, Stefan and Gieseke, Fabian and Meerbeek, Koenraad Van and Abel, Christin and Verbesselt, Jan and Somers, Ben}, title = {Predictability of abrupt shifts in dryland ecosystem functioning}, journal = {Nature Climate Change}, year = {2025}, volume = {15}, pages = {86--91}, doi = {10.1038/s41558-024-02201-0}, tags = {application,rs}, }
High-resolution sensors and deep learning models for tree resource monitoring

M. Brandt, J. Chave, S. Li, R. Fensholt, P. Ciais, J. Wigneron, F. Gieseke, S. Saatchi, C. J. Tucker, and C. Igel

JOURNALNature Reviews Electrical Engineering 2025

Abs Bib HTML

Trees contribute to carbon dioxide absorption through biomass, regulate the climate, support biodiversity, enhance soil, air and water quality, and offer economic and health benefits. Traditionally, tree monitoring on continental and global scales has focused on forest cover, whereas assessing biomass and species diversity, as well as trees outside closed-canopy forests, has been challenging. A new generation of commercial and public satellites and sensors provide high-resolution spatial and temporal optical data that can be used to identify trees as objects. Technologies from the field of artificial intelligence, such as convolutional neural networks and vision transformers, can go beyond detecting these objects as two-dimensional representations, and support characterization of the three-dimensional structure of objects, such as canopy height and wood volume, via contextual learning from two-dimensional images. These advancements enable reliable characterization of trees, their structure, biomass and diversity both inside and outside forests. Furthermore, self-supervision and foundation models facilitate large-scale applications without requiring extensive amounts of labels. Here, we summarize these advances, highlighting their application towards consistent tree monitoring systems that can assess carbon stocks, attribute losses and gains to underlying drivers and, ultimately, contribute to climate change mitigation.
@article{BrandtCLFCWGSTI2025HighResolution, author = {Brandt, Martin and Chave, Jerome and Li, Sizhuo and Fensholt, Rasmus and Ciais, Philippe and Wigneron, Jean-Pierre and Gieseke, Fabian and Saatchi, Sassan and Tucker, C. J. and Igel, Christian}, title = {High-resolution sensors and deep learning models for tree resource monitoring}, journal = {Nature Reviews Electrical Engineering}, year = {2025}, volume = {2}, pages = {13--26}, doi = {10.1038/s44287-024-00116-8}, tags = {application,rs}, }
Retrieving yearly forest growth from satellite data: A deep learning based approach

M. Schwartz, P. Ciais, E. Sean, A. de Truchis, C. Vega, N. Besic, I. Fayad, J. Wigneron, S. Brood, A. Pelissier-Tanon, J. Pauls, G. Belouze, and Y. Xu

Remote Sensing of Environment 2025

Abs Data

High-resolution mapping of forest attributes is crucial for ecosystem monitoring and carbon budget assessments. Recent advancements have leveraged satellite imagery and deep learning algorithms to generate high-resolution forest height maps. While these maps provide valuable snapshots of forest conditions, they lack the temporal resolution to estimate forest-related carbon fluxes or track annual changes. Few studies have produced annual forest height, volume, or biomass change maps validated at the forest stand level. To address this limitation, we developed a deep learning framework, coupling data from Sentinel-1 (S1), Sentinel-2 (S2) and from the Global Ecosystem Dynamics Investigation (GEDI) mission, to generate a time series of forest height, growing stock volume, and aboveground biomass at 10 to 30-m spatial resolution that we refer to as FORMS-T (FORest Multiple Satellite Time series). Unlike previous studies, we train our model on individual S2 scenes, rather than on growing season composites, to account for acquisition variability and improve generalization across years. We produced these maps for France over seven years (2018–2024) for height at 10 m resolution and further converted them to 30 m maps of growing stock volume and aboveground biomass using leaf type-specific allometric equations. Evaluation against the French National Forest Inventory (NFI) showed an average mean absolute error of 3.07 m for height (r2=0.68) across all years, 86 m3 ha-1 for volume and 65.1 Mg ha-1 for biomass. We further evaluated FORMS-T capacity to capture growth on a site where two successive airborne laser scanning (ALS) campaigns were available, showing a good agreement with ALS data when aggregating at coarser spatial resolution (r2=0.60, MAE=0.27 m for the 2020–2022 growth of trees between 10 and 15 m in 5 km pixels). Additionally, we compared our results to the NFI-based wood volume production at regional level and obtained a good agreement with a MAE of 1.45 m3 ha-1 yr-1 and r2 of 0.59. We then leveraged our height change maps to derive species-specific growth curves and compared them to ground-based measurements, highlighting distinct growth dynamics and regional variations in forest management practices. Further development of such maps could contribute to the assessment of forest-related carbon stocks and fluxes, contributing to the formulation of a comprehensive carbon budget at the country scale, and supporting global efforts to mitigate climate change.
Estimating Canopy Height at Scale

J. Pauls, M. Zimmer, U. M. Kelly, M. Schwartz, S. Saatchi, P. Ciais, S. Pokutta, M. Brandt, and F. Gieseke

ICML24 41st International Conference on Machine Learning (ICML) 2024

Abs Bib HTML PDF Code Earth Engine

We propose a framework for global-scale canopy height estimation based on satellite data. Our model leverages advanced data preprocessing techniques, resorts to a novel loss function designed to counter geolocation inaccuracies inherent in the ground-truth height measurements, and employs data from the Shuttle Radar Topography Mission to effectively filter out erroneous labels in mountainous regions, enhancing the reliability of our predictions in those areas. A comparison between predictions and ground-truth labels yields an MAE / RMSE of 2.43 / 4.73 (meters) overall and 4.45 / 6.72 (meters) for trees taller than five meters, which depicts a substantial improvement compared to existing global-scale maps. The resulting height map as well as the underlying framework will facilitate and enhance ecological analyses at a global scale, including, but not limited to, large-scale forest and biomass monitoring.
@inproceedings{pauls2024estimating, title = {Estimating Canopy Height at Scale}, author = {Pauls, Jan and Zimmer, Max and Kelly, Una M. and Schwartz, Martin and Saatchi, Sassan and Ciais, Philippe and Pokutta, Sebastian and Brandt, Martin and Gieseke, Fabian}, booktitle = {41st International Conference on Machine Learning (ICML)}, year = {2024}, custom = {Earth Engine|https://worldwidemap.projects.earthengine.app/view/canopy-height-2020}, tags = {ml,de,application}, projects = {ai4forest} }
CLIP-Branches: Interactive Fine-Tuning for Text-Image Retrieval

C. Lülf, D. M. Lima Martins, S. M. A. Vaz, Y. Zhou, and F. Gieseke

SIGIR24 Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (Demo Track) 2024

Abs Bib PDF Code

The advent of text-image models, most notably CLIP, has signifi- cantly transformed the landscape of information retrieval. These models enable the fusion of various modalities, such as text and images. One significant outcome of CLIP is its capability to allow users to search for images using text as a query, as well as vice versa. This is achieved via a joint embedding of images and text data that can, for instance, be used to search for similar items. De- spite efficient query processing techniques such as approximate nearest neighbor search, the results may lack precision and com- pleteness. We introduce CLIP-Branches, a novel text-image search engine built upon the CLIP architecture. Our approach enhances traditional text-image search engines by incorporating an interac- tive fine-tuning phase, which allows the user to further concretize the search query by iteratively defining positive and negative exam- ples. Our framework involves training a classification model given the additional user feedback and essentially outputs all positively classified instances of the entire data catalog. By building upon re- cent techniques, this inference phase, however, is not implemented by scanning the entire data catalog, but by employing efficient index structures pre-built for the data. Our results show that the fine-tuned results can improve the initial search outputs in terms of relevance and accuracy while maintaining swift response times
@inproceedings{LuelfLMVZG2024CLIPBranches, author = {Lülf, Christian and {Lima Martins}, Denis Mayr and Vaz, Salles Marcos Antonio and Zhou, Yongluan and Gieseke, Fabian}, title = {CLIP-Branches: Interactive Fine-Tuning for Text-Image Retrieval}, booktitle = {Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (Demo Track)}, year = {2024}, address = {Washington, D.C.}, tags = {ml,de}, }
Deep point cloud regression for above-ground forest biomass estimation from airborne LiDAR

S. Oehmcke, L. Li, K. Trepekli, J. C. Revenga, T. Nord-Larsen, F. Gieseke, and C. Igel

JOURNALRemote Sensing of Environment 2024

Abs Bib HTML

Quantifying forest biomass stocks and their dynamics is important for implementing effective climate change mitigation measures by aiding local forest management, studying processes driving af-, re-, and deforestation, and improving the accuracy of carbon accounting. Owing to the 3-dimensional nature of forest structure, remote sensing using airborne LiDAR can be used to perform these measurements of vegetation structure at large scale. Harnessing the full dimensionality of the data, we present deep learning systems predicting wood volume and above ground biomass (AGB) directly from the full LiDAR point cloud and compare results to state-of-the-art approaches operating on basic statistics of the point clouds. For this purpose, we devise different neural network architectures for point cloud regression and evaluate them on remote sensing data of areas for which AGB estimates have been obtained from field measurements in the Danish national forest inventory. Our adaptation of Minkowski convolutional neural networks for regression give the best results. The deep neural networks produce significantly more accurate wood volume, AGB, and carbon stock estimates compared to state-of-the-art approaches. In contrast to other methods, the proposed deep learning approach does not require a digital terrain model and is robust to artifacts along the boundaries of the evaluated areas, which we demonstrate for the case where trees protrude into the area from the outside. We expect this finding to have a strong impact on LiDAR-based analyses of biomass dynamics.
@article{OehmckeLTRNLGI2024DeepPointCloud, author = {Oehmcke, Stefan and Li, Lei and Trepekli, Katerina and Revenga, Jaime C. and Nord-Larsen, Thomas and Gieseke, Fabian and Igel, Christian}, title = {Deep point cloud regression for above-ground forest biomass estimation from airborne LiDAR}, journal = {Remote Sensing of Environment}, year = {2024}, volume = {302}, doi = {10.1016/j.rse.2023.113968}, tags = {rs,ml,application} }
Optimizing Three-Dimensional Stencil-Operations on Heterogeneous Computing Environments

N. Herrmann, J. Dieckmann, and H. Kuchen

International Journal of Parallel Programming 2024

Abs Code Website

Complex algorithms and enormous data sets require parallel execution of programs to attain results in a reasonable amount of time. Both aspects are combined in the domain of three-dimensional stencil operations, for example, computational fluid dynamics. This work contributes to the research on high-level parallel programming by discussing the generalizable implementation of a three-dimensional stencil skeleton that works in heterogeneous computing environments. Two exemplary programs, a gas simulation with the Lattice Boltzmann method, and a mean blur, are executed in a multi-node multi-graphics processing units environment, proving the runtime improvements in heterogeneous computing environments compared to a sequential program.
End-to-End Neural Network Training for Hyperbox-Based Classification

D. L. M. Martins, C. Lülf, and F. Gieseke

ESANN23 31st European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning 2023

Abs Bib HTML PDF

Hyperbox-based classification has been seen as a promising technique in which decisions on the data are represented as a series of orthogonal, multidimensional boxes (i.e., hyperboxes) that are often interpretable and human-readable. However, existing methods are no longer capable of efficiently handling the increasing volume of data many application domains face nowadays. We address this gap by proposing a novel, fully differentiable framework for hyperbox-based classification via neural networks. In contrast to previous work, our hyperbox models can be efficiently trained in an end-to-end fashion, which leads to significantly reduced training times and superior classification results.
@inproceedings{LimaDCF2023EndToEndNeural, author = {Martins, Denis Lima Mayr and Lülf, Christian and Gieseke, Fabian}, title = {End-to-End Neural Network Training for Hyperbox-Based Classification}, booktitle = {31st European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning}, year = {2023}, address = {Brügge}, tags = {ml} }
Fast Search-By-Classification for Large-Scale Databases Using Index-Aware Decision Trees and Random Forests

C. Lülf, D. M. Lima Martins, S. M. A. Vaz, Y. Zhou, and F. Gieseke

VLDB23 Proceedings of the VLDB Endowment 2023

Abs Bib HTML PDF

The vast amounts of data collected in various domains pose great challenges to modern data exploration and analysis. To find “inter- esting” objects in large databases, users typically define a query using positive and negative example objects and train a classifi- cation model to identify the objects of interest in the entire data catalog. However, this approach requires a scan of all the data to apply the classification model to each instance in the data catalog, making this method prohibitively expensive to be employed in large-scale databases serving many users and queries interactively. In this work, we propose a novel framework for such search-by- classification scenarios that allows users to interactively search for target objects by specifying queries through a small set of positive and negative examples. Unlike previous approaches, our frame- work can rapidly answer such queries at low cost without scanning the entire database. Our framework is based on an index-aware construction scheme for decision trees and random forests that transforms the inference phase of these classification models into a set of range queries, which in turn can be efficiently executed by leveraging multidimensional indexing structures. Our experiments show that queries over large data catalogs with hundreds of millions of objects can be processed in a few seconds using a single server, compared to hours needed by classical scanning-based approaches.
@inproceedings{LuelfLMVZG2023FastSearchByClassification, author = {Lülf, Christian and {Lima Martins}, Denis Mayr and Vaz, Salles Marcos Antonio and Zhou, Yongluan and Gieseke, Fabian}, title = {Fast Search-By-Classification for Large-Scale Databases Using Index-Aware Decision Trees and Random Forests}, booktitle = {Proceedings of the VLDB Endowment}, pages = {2845--2857}, volume = {16}, editor = {VLDB, Endowment}, year = {2023}, publisher = {ACM Press}, address = {Vancouver}, issn = {2150-8097}, doi = {10.14778/3611479.3611492}, tags = {ml,de} }
RapidEarth: A Search Engine for Large-Scale Geospatial Imagery Best Demo Award

C. Lülf, D. M. Lima Martins, S. M. A. Vaz, Y. Zhou, and F. Gieseke

SIGSPATIAL23 Proceedings of the 31st International Conference on Advances in Geographic Information Systems, SIGSPATIAL, Demo Paper, 2023 (Best Demo Award) 2023

Abs HTML PDF

Data exploration and analysis in various domains often necessitate the search for specific objects in massive databases. A common search strategy, often known as search-by-classification, resorts to training machine learning models on small sets of positive and negative samples and to performing inference on the entire database to discover additional objects of interest. While such an approach often yields very good results in terms of classification performance, the entire database usually needs to be scanned, a process that can easily take several hours even for medium-sized data catalogs. In this work, we present RapidEarth, a geospatial search-by-classification engine that allows analysts to rapidly search for interesting objects in very large data collections of satellite imagery in a matter of seconds, without the need to scan the entire data catalog. RapidEarth embodies a co-design of multidimensional indexing structures and decision branches, a recently proposed variant of classical decision trees. These decision branches allow RapidEarth to transform the inference phase into a set of range queries, which can be efficiently processed by leveraging the aforementioned multidimensional indexing structures. The main contribution of this work is a geospatial search engine that implements these technical findings.
Deep learning enables image-based tree counting, crown segmentation, and height prediction at national scale

S. Li, M. Brandt, R. Fensholt, A. Kariryaa, C. Igel, F. Gieseke, T. Nord-Larsen, S. Oehmcke, A. H. Carlsen, S. Junttila, X. Tong, A. d’Aspremont, and P. Ciais

JOURNALPNAS Nexus 2023

Abs Bib HTML

Sustainable tree resource management is the key to mitigating climate warming, fostering a green economy, and protecting valuable habitats. Detailed knowledge about tree resources is a prerequisite for such management but is conventionally based on plot-scale data, which often neglects trees outside forests. Here, we present a deep learning-based framework that provides location, crown area, and height for individual overstory trees from aerial images at country scale. We apply the framework on data covering Denmark and show that large trees (stem diameter >10 cm) can be identified with a low bias (12.5%) and that trees outside forests contribute to 30% of the total tree cover, which is typically unrecognized in national inventories. The bias is high (46.6%) when our results are evaluated against all trees taller than 1.3 m, which involve undetectable small or understory trees. Furthermore, we demonstrate that only marginal effort is needed to transfer our framework to data from Finland, despite markedly dissimilar data sources. Our work lays the foundation for digitalized national databases, where large trees are spatially traceable and manageable.
@article{LiSMRACFTSASXAP2023DeepLearning, author = {Li, Sizhuo and Brandt, Martin and Fensholt, Rasmus and Kariryaa, Ankit and Igel, Christian and Gieseke, Fabian and Nord-Larsen, Thomas and Oehmcke, Stefan and Carlsen, Ask Holm and Junttila, Samuli and Tong, Xiaoye and d’Aspremont, Alexandre and Ciais, Philippe}, title = {Deep learning enables image-based tree counting, crown segmentation, and height prediction at national scale}, journal = {PNAS Nexus}, year = {2023}, volume = {2}, number = {4}, doi = {10.1093/pnasnexus/pgad076}, tags = {rs,application,ml} }
More than one quarter of Africa’s tree cover is found outside areas previously classified as forest

F. Reiner, M. Brandt, X. Tong, D. Skole, A. Kariryaa, P. Ciais, A. Davies, P. Hiernaux, J. Chave, M. Mugabowindekwe, C. Igel, S. Oehmcke, F. Gieseke, S. Li, S. Liu, S. S. Saatchi, P. Boucher, J. Singh, S. Taugourdeau, M. Dendoncker, X. Song, O. Mertz, C. Tucker, and R. Fensholt

JOURNALNature Communications 2023

Abs Bib HTML

The consistent monitoring of trees both inside and outside of forests is key to sustainable land management. Current monitoring systems either ignore trees outside forests or are too expensive to be applied consistently across countries on a repeated basis. Here we use the PlanetScope nanosatellite constellation, which delivers global very high-resolution daily imagery, to map both forest and non-forest tree cover for continental Africa using images from a single year. Our prototype map of 2019 (RMSE = 9.57%, bias = −6.9%). demonstrates that a precise assessment of all tree-based ecosystems is possible at continental scale, and reveals that 29% of tree cover is found outside areas previously classified as tree cover in state-of-the-art maps, such as in croplands and grassland. Such accurate mapping of tree cover down to the level of individual trees and consistent among countries has the potential to redefine land use impacts in non-forest landscapes, move beyond the need for forest definitions, and build the basis for natural climate solutions and tree-related studies.
@article{ReinerBTSKCDHCMIOGLLSBSTDSMTF2023MoreThanOne, author = {Reiner, Florian and Brandt, Martin and Tong, Xiaoye and Skole, David and Kariryaa, Ankit and Ciais, Philippe and Davies, Andrew and Hiernaux, Pierre and Chave, Jerome and Mugabowindekwe, Maurice and Igel, Christian and Oehmcke, Stefan and Gieseke, Fabian and Li, Sizhuo and Liu, Siyu and Saatchi, Sassan S and Boucher, Peter and Singh, Jenia and Taugourdeau, Simon and Dendoncker, Morgane and Song, Xiao-Peng and Mertz, Ole and Tucker, Compton and Fensholt, Rasmus}, title = {More than one quarter of Africa's tree cover is found outside areas previously classified as forest}, journal = {Nature Communications}, year = {2023}, tags = {rs,application,ml} }
Distributed Calculations with Algorithmic Skeletons for Heterogeneous Computing Environments

N. Herrmann, and H. Kuchen

International Journal of Parallel Programming 2023

Abs Code Website

Contemporary HPC hardware typically provides several levels of parallelism, e.g. multiple nodes, each having multiple cores (possibly with vectorization) and accel- erators. Efficiently programming such systems usually requires skills in combining several low-level frameworks such as MPI, OpenMP, and CUDA. This overburdens programmers without substantial parallel programming skills. One way to overcome this problem and to abstract from details of parallel programming is to use algo- rithmic skeletons. In the present paper, we evaluate the multi-node, multi-CPU and multi-GPU implementation of the most essential skeletons Map, Reduce, and Zip. Our main contribution is a discussion of the efficiency of using multiple paralleliza- tion levels and the consideration of which fine-tune settings should be offered to the user.
Input Selection for Bandwidth-Limited Neural Network Inference

S. Oehmcke, and F. Gieseke

SDM22 Proceedings of the 2022 SIAM International Conference on Data Mining (SDM) 2022

Abs Bib HTML PDF

Data are often accommodated on centralized storage servers. This is the case, for instance, in remote sensing and astronomy, where projects produce several petabytes of data every year. While machine learning models are often trained on relatively small subsets of the data, the inference phase typically requires transferring significant amounts of data between the servers and the clients. In many cases, the bandwidth available per user is limited, which then renders the data transfer to be one of the major bottlenecks. In this work, we propose a framework that automatically selects the relevant parts of the input data for a given neural network. The model as well as the associated selection masks are trained simultaneously such that a good model performance is achieved while only a minimal amount of data is selected. During the inference phase, only those parts of the data have to be transferred between the server and the client. We propose both instance-independent and instance-dependent selection masks. The former ones are the same for all instances to be transferred, whereas the latter ones allow for variable transfer sizes per instance. Our experiments show that it is often possible to significantly reduce the amount of data needed to be transferred without affecting the model quality much.
@inproceedings{OehmckeG2022InputSelection, author = {Oehmcke, Stefan and Gieseke, Fabian}, title = {Input Selection for Bandwidth-Limited Neural Network Inference}, booktitle = {Proceedings of the 2022 SIAM International Conference on Data Mining (SDM)}, pages = {280--288}, editor = {Banerjee, Arindam and Zhou, Zhi-Hua and Papalexakis, Evangelos E. and Riondato, Matteo}, year = {2022}, publisher = {SIAM Publications}, address = {USA}, doi = {10.1137/1.9781611977172.32}, tags = {ml,de}, }
Deep Learning Based 3D Point Cloud Regression for Estimating Forest Biomass

S. Oehmcke, L. Li, J. Revenga, T. Nord-Larsen, K. Trepekli, F. Gieseke, and C. Igel

SIGSPATIAL22 30th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM SIGSPATIAL 2022) 2022

Abs Bib HTML PDF

Quantification of forest biomass stocks and their dynamics is important for implementing effective climate change mitigation measures. The knowledge is needed, e.g., for local forest management, studying the processes driving af-, re-, and deforestation, and can improve the accuracy of carbon-accounting. Remote sensing using airborne LiDAR can be used to perform these measurements of vegetation structure at large scale. We present deep learning systems for predicting wood volume, above-ground biomass (AGB), and subsequently above-ground carbon stocks directly from airborne LiDAR point clouds. We devise different neural network architectures for point cloud regression and evaluate them on remote sensing data of areas for which AGB estimates have been obtained from field measurements in the Danish national forest inventory. Our adaptation of Minkowski convolutional neural networks for regression gave the best results. The deep neural networks produced significantly more accurate wood volume, AGB, and carbon stock estimates compared to state-of-the-art approaches operating on basic statistics of the point clouds. In contrast to other methods, the proposed deep learning approach does not require a digital terrain model. We expect this finding to have a strong impact on LiDAR-based analyses of biomass dynamics.
@inproceedings{OehmckeLRNLTGI2022DeepLearning, author = {Oehmcke, Stefan and Li, Lei and Revenga, Jaime and Nord-Larsen, Thomas and Trepekli, Katerina and Gieseke, Fabian and Igel, Christian}, title = {Deep Learning Based 3D Point Cloud Regression for Estimating Forest Biomass}, booktitle = {30th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM SIGSPATIAL 2022)}, pages = {1--4}, editor = {Renz, Matthias and Sarwat, Mohamed}, year = {2022}, publisher = {ACM Press}, address = {New York, NY, USA}, doi = {10.1145/3557915.3561471}, tags = {ml,application,rs} }
Using high-resolution imagery and deep learning to classify land-use following deforestation: a case study in Ethiopia

R. N. Masolele, S. V. De, D. Marcos, J. Verbesselt, F. Gieseke, K. A. Mulatu, Y. Moges, H. Sebrala, C. Martius, and M. Herold

JOURNALGIScience and Remote Sensing 2022

Abs Bib HTML

National-scale assessments of post-deforestation land-use are crucial for decreasing deforestation and forest degradation-related emissions. In this research, we assess the potential of different satellite data modalities (single-date, multi-date, multi-resolution, and an ensemble of multi-sensor images) for classifying land-use following deforestation in Ethiopia using the U-Net deep neural network architecture enhanced with attention. We performed the analysis on satellite image data retrieved across Ethiopia from freely available Landsat-8, Sentinel-2 and Planet-NICFI satellite data. The experiments aimed at an analysis of (a) single-date images from individual sensors to account for the differences in spatial resolution between image sensors in detecting land-uses, (b) ensembles of multiple images from different sensors (Planet-NICFI/Sentinel-2/Landsat-8) with different spatial resolutions, (c) the use of multi-date data to account for the contribution of temporal information in detecting land-uses, and, finally, (d) the identification of regional differences in terms of land-use following deforestation in Ethiopia. We hypothesize that choosing the right satellite imagery (sensor) type is crucial for the task. Based on a comprehensive visually interpreted reference dataset of 11 types of post-deforestation land-uses, we find that either detailed spatial patterns (single-date Planet-NICFI) or detailed temporal patterns (multi-date Sentinel-2, Landsat-8) are required for identifying land-use following deforestation, while medium-resolution single-date imagery is not sufficient to achieve high classification accuracy. We also find that adding soft-attention to the standard U-Net improved the classification accuracy, especially for small-scale land-uses. The models and products presented in this work can be used as a powerful data resource for governmental and forest monitoring agencies to design and monitor deforestation mitigation measures and data-driven land-use policy.
@article{MasoleleDMVGMMSMH2022UsingHighResolution, author = {Masolele, Robert N. and De, Sy Veronique and Marcos, Diego and Verbesselt, Jan and Gieseke, Fabian and Mulatu, Kalkidan Ayele and Moges, Yitebitu and Sebrala, Heiru and Martius, Christopher and Herold, Martin}, title = {Using high-resolution imagery and deep learning to classify land-use following deforestation: a case study in Ethiopia}, journal = {GIScience and Remote Sensing}, year = {2022}, volume = {59}, number = {1}, pages = {1446--1472}, doi = {10.1080/15481603.2022.2115619}, tags = {application,rs}, }
Nation-wide mapping of tree-level aboveground carbon stocks in Rwanda

M. Mugabowindekwe, M. Brandt, J. Chave, F. Reiner, D. Skole, A. Kariryaa, C. Igel, P. Hiernaux, P. Ciais, O. Mertz, X. Tong, S. Li, G. Rwanyiziri, T. Dushimiyimana, A. Ndoli, U. Valens, J. Lillesø, F. Gieseke, C. Tucker, S. S. Saatchi, and R. Fensholt

JOURNALNature Climate Change 2022

Abs Bib HTML

Trees sustain livelihoods and mitigate climate change but a predominance of trees outside forests and limited resources make it difficult for many tropical countries to conduct automated nation-wide inventories. Here, we propose an approach to map the carbon stock of each individual overstory tree at the national scale of Rwanda using aerial imagery from 2008 and deep learning. We show that 72% of the mapped trees are located in farmlands and savannas and 17% in plantations, accounting for 48.6% of the national aboveground carbon stocks. Natural forests cover 11% of the total tree count and 51.4% of the national carbon stocks, with an overall carbon stock uncertainty of 16.9%. The mapping of all trees allows partitioning to any landscapes classification and is urgently needed for effective planning and monitoring of restoration activities as well as for optimization of carbon sequestration, biodiversity and economic benefits of trees.
@article{MugabowindekweBCRSKIHCMTLRDNVLGTSF2022NationWide, author = {Mugabowindekwe, Maurice and Brandt, Martin and Chave, Jerome and Reiner, Florian and Skole, David and Kariryaa, Ankit and Igel, Christian and Hiernaux, Pierre and Ciais, Philippe and Mertz, Ole and Tong, Xiaoye and Li, Sizhuo and Rwanyiziri, Gaspard and Dushimiyimana, Thaulin and Ndoli, Alain and Valens, Uwizeyimana and Lillesø, Jens-Peter and Gieseke, Fabian and Tucker, Compton and Saatchi, Sassan S and Fensholt, Rasmus}, title = {Nation-wide mapping of tree-level aboveground carbon stocks in Rwanda}, journal = {Nature Climate Change}, year = {2022}, volume = {13}, doi = {10.1038/s41558-022-01544-w}, tags = {application,rs}, }
Above-Ground Biomass Prediction for Croplands at a Sub-Meter Resolution Using UAV–LiDAR and Machine Learning Methods

J. C. Revenga, K. Trepekli, S. Oehmcke, R. Jensen, L. Li, C. Igel, F. Gieseke, and T. Friborg

JOURNALRemote Sensing (Remote Sens.) 2022

Abs Bib HTML

Current endeavors to enhance the accuracy of in situ above-ground biomass (AGB) prediction for croplands rely on close-range monitoring surveys that use unstaffed aerial vehicles (UAVs) and mounted sensors. In precision agriculture, light detection and ranging (LiDAR) technologies are currently used to monitor crop growth, plant phenotyping, and biomass dynamics at the ecosystem scale. In this study, we utilized a UAV–LiDAR sensor to monitor two crop fields and a set of machine learning (ML) methods to predict real-time AGB over two consecutive years in the region of Mid-Jutland, Denmark. During each crop growing period, UAV surveys were conducted in parallel with AGB destructive sampling every 7–15 days, the AGB samples from which were used as the ground truth data. We evaluated the ability of the ML models to estimate the real-time values of AGB at a sub-meter resolution (0.17–0.52 m2). An extremely randomized trees (ERT) regressor was selected for the regression analysis, based on its predictive performance for the first year’s growing season. The model was retrained using previously identified hyperparameters to predict the AGB of the crops in the second year. The ERT performed AGB estimation using height and reflectance metrics from LiDAR-derived point cloud data and achieved a prediction performance of 𝑅2 = 0.48 at a spatial resolution of 0.35 m2. The prediction performance could be improved significantly by aggregating adjacent predictions (𝑅2 = 0.71 and 𝑅2 = 0.93 at spatial resolutions of 1 m2 and 2 m2, respectively) as they ultimately converged to the reference biomass values because any individual errors averaged out. The AGB prediction results were examined as function of predictor type, training set size, sampling resolution, phenology, and canopy density. The results demonstrated that when combined with ML regression methods, the UAV–LiDAR method could be used to provide accurate real-time AGB prediction for crop fields at a high resolution, thereby providing a way to map their biochemical constituents.
@article{RevengaTOJLIGF2022AboveGround, author = {Revenga, Jaime C. and Trepekli, Katerina and Oehmcke, Stefan and Jensen, Rasmus and Li, Lei and Igel, Christian and Gieseke, Fabian and Friborg, Thomas}, title = {Above-Ground Biomass Prediction for Croplands at a Sub-Meter Resolution Using UAV–LiDAR and Machine Learning Methods}, journal = {Remote Sensing (Remote Sens.)}, year = {2022}, volume = {14}, number = {16}, pages = {3912}, doi = {10.3390/rs14163912}, tags = {application,rs} }
Stencil calculations with algorithmic skeletons for heterogeneous computing environments

N. Herrmann, B. Menezes, and H. Kuchen

International Journal of Parallel Programming 2022

Abs Code Website

The development of parallel applications is a difficult and error-prone task, especially for inexperienced programmers. Stencil operations are exceptionally complex for parallelization as synchronization and communication between the individual pro- cesses and threads are necessary. It gets even more difficult to efficiently distribute the computations and efficiently implement communication when heterogeneous computing environments are used. For using multiple nodes, each having multiple cores and accelerators such as GPUs, skills in combining frameworks such as MPI, OpenMP, and CUDA are required. The complexity of parallelizing the stencil operation increases the need for abstracting from the platform-specific details and simplify parallel programming. One way to abstract from details of parallel pro- gramming is to use algorithmic skeletons. This work introduces an implementation of the MapStencil skeleton that is able to generate parallel code for distributed memory environments, using multiple nodes with multicore CPUs and GPUs. Examples of practical applications of the MapStencil skeleton are the Jacobi Solver or the Canny Edge Detector. The main contribution of this paper is a discussion of the difficulties when implementing a universal Skeleton for MapStencil for heterogeneous com- puting environments and an outline of the identified best practices for communica- tion intense skeletons.
Attentional Feature Fusion

Y. Dai, F. Gieseke, S. Oehmcke, Y. Wu, and K. Barnard

WACV21 Proceedings of the Workshop on Applications of Computer Vision (WACV) 2021

Abs Bib HTML PDF

Feature fusion, the combination of features from different layers or branches, is an omnipresent part of modern network architectures. It is often implemented via simple operations, such as summation or concatenation, but this might not be the best choice. In this work, we propose a uniform and general scheme, namely attentional feature fusion, which is applicable for most common scenarios, including feature fusion induced by short and long skip connections as well as within Inception layers. To better fuse features of inconsistent semantics and scales, we propose a multi-scale channel attention module, which addresses issues that arise when fusing features given at different scales. We also demonstrate that the initial integration of feature maps can become a bottleneck and that this issue can be alleviated by adding another level of attention, which we refer to as iterative attentional feature fusion. With fewer layers or parameters, our models outperform state-of-the-art networks on both CIFAR-100 and ImageNet datasets, which suggests that more sophisticated attention mechanisms for feature fusion hold great potential to consistently yield better results compared to their direct counterparts. Our codes and trained models are available online.
@inproceedings{DaiGOWB2021Attentional, author = {Dai, Yimian and Gieseke, Fabian and Oehmcke, Stefan and Wu, Yiquan and Barnard, Kobus}, title = {Attentional Feature Fusion}, booktitle = {Proceedings of the Workshop on Applications of Computer Vision (WACV)}, pages = {3559--3568}, year = {2021}, publisher = {IEEE}, doi = {10.1109/WACV48630.2021.00360}, tags = {ml}, }
Dataset Sensitive Autotuning of Multi-versioned Code Based on Monotonic Properties — Autotuning in Futhark Best Paper Award

P. Munksgaard, S. L. Breddam, T. Henriksen, F. Gieseke, and C. E. Oancea

TFP21 Proceedings of the 22nd International Symposium on Trends in Functional Programming (TFP) (Best Paper Award) 2021

Abs Bib HTML

Functional languages allow rewrite-rule systems that aggressively generate a multitude of semantically-equivalent but differently-optimized code versions. In the context of GPGPU execution, this paper addresses the important question of how to compose these code versions into a single program that (near-)optimally discriminates them across different datasets. Rather than aiming at a general autotuning framework reliant on stochastic search, we argue that in some cases, a more effective solution can be obtained by customizing the tuning strategy for the compiler transformation producing the code versions. We present a simple and highly-composable strategy which requires that the (dynamic) program property used to discriminate between code versions conforms with a certain monotonicity assumption. Assuming the monotonicity assumption holds, our strategy guarantees that if an optimal solution exists it will be found. If an optimal solution doesn’t exist, our strategy produces human tractable and deterministic results that provide insights into what went wrong and how it can be fixed. We apply our tuning strategy to the incremental-flattening transformation supported by the publicly-available Futhark compiler and compare with a previous black-box tuning solution that uses the popular OpenTuner library. We demonstrate the feasibility of our solution on a set of standard datasets of real-world applications and public benchmark suites, such as Rodinia and FinPar. We show that our approach shortens the tuning time by a factor of 6× on average, and more importantly, in five out of eleven cases, it produces programs that are (as high as 10×) faster than the ones produced by the OpenTuner-based technique.
@inproceedings{MunksgaardBHGO2021DatasetSensitive, author = {Munksgaard, Philip and Breddam, Svend Lund and Henriksen, Troels and Gieseke, Fabian and Oancea, Cosmin Eugen}, title = {Dataset Sensitive Autotuning of Multi-versioned Code Based on Monotonic Properties --- Autotuning in Futhark}, booktitle = {Proceedings of the 22nd International Symposium on Trends in Functional Programming (TFP)}, pages = {3--23}, series = {Lecture Notes in Computer Science}, volume = {12834}, year = {2021}, publisher = {Springer}, address = {Virtual Event}, doi = {10.1007/978-3-030-83978-9}, tags = {de}, }
Estimating Forest Canopy Height With Multi-Spectral and Multi-Temporal Imagery Using Deep Learning

S. Oehmcke, T. Nyegaard-Signori, K. Grogan, and F. Gieseke

IEEE BIGDATA 2021 2021 IEEE International Conference on Big Data (Big Data) 2021

Abs Bib HTML

Canopy height is a vital indicator to asses carbon uptake and productivity of forests. However, precise measurements, such as from airborne or spaceborne 3D laser scanning (LiDAR), are expensive and usually cover only small areas. In this work, we propose a novel deep learning model that can generate detailed maps of tree canopy heights. In contrast to previous approaches that use a single image as input, we process multi-temporal data via a an adaptation of the popular U-Net architecture that is based on the EfficientNet and 3D convolution operators. To that end, our model receives multi-spectral Landsat satellite imagery as input and can predict continuous height maps. As labeled data, we resort to spatially sparse LiDAR data from ICESat-2. Thus, with such a model, one can produce dense canopy height maps given only multi-spectral Landsat data. Our experimental evaluation shows that our our model outperforms existing and improved single-temporal models. To test generalizability, we created a non-overlapping dataset to evaluate our approach and further tested the model performance on out-of-distribution data. The results show that our model can successfully learn drastic changes in distribution.
@inproceedings{OehmckeNSGG2021Estimating, author = {Oehmcke, Stefan and Nyegaard-Signori, Thomas and Grogan, Kenneth and Gieseke, Fabian}, title = {Estimating Forest Canopy Height With Multi-Spectral and Multi-Temporal Imagery Using Deep Learning}, booktitle = {2021 {IEEE} International Conference on Big Data (Big Data)}, pages = {4915--4924}, editor = {Chen, Yixin and Ludwig, Heiko and Tu, Yicheng and Fayyad, Usama M. and Zhu, Xingquan and Hu, Xiaohua and Byna, Suren and Liu, Xiong and Zhang, Jianping and Pan, Shirui and Papalexakis, Vagelis and Wang, Jianwu and Cuzzocrea, Alfredo and Ordonez, Carlos}, year = {2021}, publisher = {Wiley-IEEE Press}, address = {Orlando, US}, doi = {10.1109/BigData52589.2021.9672018}, tags = {application,rs}, }
Spatial and temporal deep learning methods for deriving land-use following deforestation: A pan-tropical case study using Landsat time series

R. N. Masolele, V. De Sy, M. Herold, D. Marcos, J. Verbesselt, F. Gieseke, A. G. Mullissa, and C. Martius

JOURNALRemote Sensing of Environment 2021

Abs Bib HTML

Assessing land-use following deforestation is vital for reducing emissions from deforestation and forest degradation. In this paper, for the first time, we assess the potential of spatial, temporal and spatio-temporal deep learning methods for large-scale classification of land-use following tropical deforestation using dense satellite time series over six years on the pan-tropical scale (incl. Latin America, Africa, and Asia). Based on an extensive reference database of six forest to land-use conversion types, we find that the spatio-temporal models achieved a substantially higher F1-score accuracies than models that account only for spatial or temporal patterns. Although all models performed better when the scope of the problem was limited to a single continent, the spatial models were more competitive than the temporal ones in this setting. These results suggest that the spatial patterns of land-use within a continent share more commonalities than the temporal patterns and the spatial patterns across continents. This work explores the feasibility of extending and complementing previous efforts for characterizing follow-up land-use after deforestation at a small-scale via human visual interpretation of high resolution RGB imagery. It supports the usage of fast and automated large-scale land-use classification and showcases the value of deep learning methods combined with spatio-temporal satellite data to effectively address the complex tasks of identifying land-use following deforestation in a scalable and cost effective manner.
@article{MasoleleDHMVGMM2021SpatialAnd, author = {Masolele, Robert N. and {De Sy}, Veronique and Herold, Martin and Marcos, Diego and Verbesselt, Jan and Gieseke, Fabian and Mullissa, Adugna G. and Martius, Christopher}, title = {Spatial and temporal deep learning methods for deriving land-use following deforestation: A pan-tropical case study using Landsat time series}, journal = {Remote Sensing of Environment}, year = {2021}, volume = {264}, pages = {112600}, doi = {10.1016/j.rse.2021.112600}, tags = {application,rs}, }
High-level parallel ant colony optimization with algorithmic skeletons

B. Menezes, N. Herrmann, H. Kuchen, and F. Lima Neto

International Journal of Parallel Programming 2021

Abs Code Website

Parallel implementations of swarm intelligence algorithms such as the ant colony optimization (ACO) have been widely used to shorten the execution time when solv- ing complex optimization problems. When aiming for a GPU environment, devel- oping efficient parallel versions of such algorithms using CUDA can be a difficult and error-prone task even for experienced programmers. To overcome this issue, the parallel programming model of Algorithmic Skeletons simplifies parallel programs by abstracting from low-level features. This is realized by defining common pro- gramming patterns (e.g. map, fold and zip) that later on will be converted to effi- cient parallel code. In this paper, we show how algorithmic skeletons formulated in the domain specific language Musket can cope with the development of a paral- lel implementation of ACO and how that compares to a low-level implementation. Our experimental results show that Musket suits the development of ACO. Besides making it easier for the programmer to deal with the parallelization aspects, Musket generates high performance code with similar execution times when compared to low-level implementations

Y. Dai, S. Oehmcke, F. Gieseke, Y. Wu, and K. Barnard

ICPR20 Proceedings of the 25th International Conference on Pattern Recognition (ICPR) 2020

@inproceedings{DaiOGWB2020AttentionAs,
  author = {Dai, Yimian and Oehmcke, Stefan and Gieseke, Fabian and Wu, Yiquan and Barnard, Kobus},
  title = {Attention as Activation},
  booktitle = {Proceedings of the 25th International Conference on Pattern Recognition (ICPR)},
  pages = {9156--9163},
  year = {2020},
  publisher = {IEEE},
  address = {Milan, Italy},
  doi = {10.1109/ICPR48806.2021.9413020},
  tags = {ml},
}

Massively-Parallel Change Detection for Satellite Time Series Data with Missing Values

F. Gieseke, S. Rosca, T. Henriksen, J. Verbesselt, and C. E. Oancea

ICDE20 Proceedings of the 36th IEEE International Conference on Data Engineering (ICDE) 2020

Abs Bib HTML PDF

Large amounts of satellite data are now becoming available, which, in combination with appropriate change detection methods, offer the opportunity to derive accurate information on timing and location of disturbances such as deforestation events across the earth surface. Typical scenarios require the analysis of billions of image patches/pixels. While various change detection techniques have been proposed in the literature, the associated implementations usually do not scale well, which renders the corresponding analyses computationally very expensive or even impossible. In this work, we propose a novel massively-parallel implementation for a state-of-the-art change detection method and demonstrate its potential in the context of monitoring deforestation. The novel implementation can handle large scenarios in a few hours or days using cheap commodity hardware, compared to weeks or even years using the existing publicly available code, and enables researchers, for the first time, to conduct global-scale analyses covering large parts of our Earth using little computational resources. From a technical perspective, we provide a high-level parallel algorithm specification along with several performance-critical optimizations dedicated to efficiently map the specified parallelism to modern parallel devices. While a particular change detection method is addressed in this work, the algorithmic building blocks provided are also of immediate relevance to a wide variety of related approaches in remote sensing and other fields.
@inproceedings{GiesekeRHVO2020MassivelyParallel, author = {Gieseke, Fabian and Rosca, Sabina and Henriksen, Troels and Verbesselt, Jan and Oancea, Cosmin Eugen}, title = {Massively-Parallel Change Detection for Satellite Time Series Data with Missing Values}, booktitle = {Proceedings of the 36th {IEEE} International Conference on Data Engineering (ICDE)}, pages = {385--396}, year = {2020}, address = {Dallas, USA}, doi = {10.1109/ICDE48307.2020.00040}, tags = {de,rs}, }
Approximate Nearest-Neighbour Fields via Massively-Parallel Propagation-Assisted K-D Trees

C. E. Oancea, T. Robroek, and F. Gieseke

IEEE BIGDATA 2020 2020 IEEE International Conference on Big Data 2020

Abs Bib HTML PDF

Nearest neighbour fields accurately and intuitively describe the transformation between two images and have been heavily used in computer vision. Generating such fields, however, is not an easy task due to the induced computational complexity, which quickly grows with the sizes of the images. Modern parallel devices such as graphics processing units depict a viable way of reducing the practical run time of such compute-intensive tasks. In this work, we propose a novel parallel implementation for one of the state-of-the-art methods for the computation of nearest neighbour fields, called p ropagation-assisted k -d trees. The resulting implementation yields valuable computational savings over a corresponding multi-core implementation. Additionally, it is tuned to consume only little additional memory and is, hence, capable of dealing with high-resolution image data, which is vital as image quality standards keep rising.
@inproceedings{OanceaRG2020Approximate, author = {Oancea, Cosmin Eugen and Robroek, Ties and Gieseke, Fabian}, title = {Approximate Nearest-Neighbour Fields via Massively-Parallel Propagation-Assisted K-D Trees}, booktitle = {2020 {IEEE} International Conference on Big Data}, pages = {5172--5181}, year = {2020}, publisher = {IEEE}, doi = {10.1109/BigData50022.2020.9378426}, tags = {de,ml}, }
Creating Cloud-Free Satellite Imagery from Image Time Series with Deep Learning

S. Oehmcke, T. K. Chen, A. V. Prishchepov, and F. Gieseke

BIGSPATIAL20 Proceedings of the 9th ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data 2020

Abs Bib HTML

Optical satellite images are important for environmental monitoring. Unfortunately, such images are often affected by distortions, such as clouds, shadows, or missing data. This work proposes a deep learning approach for cleaning and imputing satellite images, which could serve as a reliable preprocessing step for spatial and spatio-temporal analyzes. More specifically, a coherent and cloud-free image for a specific target date and region is created based on a sequence of images of that region obtained at previous dates. Our model first extracts information from the previous time steps via a special gating function and then resorts to a modified version of the well-known U-Net architecture to obtain the desired output image. The model uses supplementary data, namely the approximate cloud coverage of input images, the temporal distance to the target time, and a missing data mask for each input time step. During the training phase we condition our model with the targets cloud coverage and missing values (disabled in production), which allows us to use data afflicted by distortion during training and thus does not require pre-selection of distortion-free data. Our experimental evaluation, conducted on data of the Landsat missions, shows that our approach outperforms the commonly utilized approach that resorts to taking the median of cloud-free pixels for a given position. This is especially the case when the quality of the data for the considered period is poor (e.g., lack of cloud free-images during the winter/fall periods). Our deep learning approach allows to improve the utility of the entire Landsat archive, the only existing global medium-resolution free-access satellite archive dating back to the 1970s. It therefore holds scientific and societal potential for future analyses conducted on data from this and other satellite imagery repositories.
@inproceedings{OehmckeTHPG2020CreatingCloudFree, author = {Oehmcke, Stefan and Chen, Tzu-Hsin Karen and Prishchepov, Alexander V. and Gieseke, Fabian}, title = {Creating Cloud-Free Satellite Imagery from Image Time Series with Deep Learning}, booktitle = {Proceedings of the 9th ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data}, pages = {3:1-3:10}, year = {2020}, publisher = {ACM}, address = {Seattle, USA}, doi = {10.1145/3423336.3429345}, tags = {ml,de,rs,application}, }
An unexpectedly large count of trees in the West African Sahara and Sahel

M. Brandt, C. Tucker, A. Kariryaa, K. Rasmussen, C. Abel, J. Small, J. Chave, L. Rasmussen, P. Hiernaux, A. Diouf, L. Kergoat, O. Mertz, C. Igel, F. Gieseke, J. Schöning, S. Li, K. Melocik, J. Meyer, SinnoS, E. Romero, E. Glennie, A. Montagu, M. Dendoncker, and R. Fensholt

JOURNALNature 2020

Abs Bib HTML

A large proportion of dryland trees and shrubs (hereafter referred to collectively as trees) grow in isolation, without canopy closure. These non-forest trees have a crucial role in biodiversity, and provide ecosystem services such as carbon storage, food resources and shelter for humans and animals. However, most public interest relating to trees is devoted to forests, and trees outside of forests are not well-documented. Here we map the crown size of each tree more than 3 m2 in size over a land area that spans 1.3 million km2 in the West African Sahara, Sahel and sub-humid zone, using submetre-resolution satellite imagery and deep learning. We detected over 1.8 billion individual trees (13.4 trees per hectare), with a median crown size of 12 m2, along a rainfall gradient from 0 to 1,000 mm per year. The canopy cover increases from 0.1% (0.7 trees per hectare) in hyper-arid areas, through 1.6% (9.9 trees per hectare) in arid and 5.6% (30.1 trees per hectare) in semi-arid zones, to 13.3% (47 trees per hectare) in sub-humid areas. Although the overall canopy cover is low, the relatively high density of isolated trees challenges prevailing narratives about dryland desertification, and even the desert shows a surprisingly high tree density. Our assessment suggests a way to monitor trees outside of forests globally, and to explore their role in mitigating degradation, climate change and poverty.
@article{BrandtTKRASCRHDKMIGSLMMSRGMDF2020AnUnexpectedly, author = {Brandt, M and Tucker, C and Kariryaa, A and Rasmussen, K and Abel, C and Small, J and Chave, J and Rasmussen, L and Hiernaux, P and Diouf, A and Kergoat, L and Mertz, O and Igel, C and Gieseke, F and Schöning, J and Li, S and Melocik, K and Meyer, J and SinnoS and Romero, E and Glennie, E and Montagu, A and Dendoncker, M and Fensholt, R}, title = {An unexpectedly large count of trees in the West African Sahara and Sahel}, journal = {Nature}, year = {2020}, volume = {2020}, doi = {10.1038/s41586-020-2824-5}, tags = {application,rs,ml} }
Implementation of BFASTmonitor Algorithm on Google Earth Engine to Support Large-Area and Sub-Annual Change Monitoring Using Earth Observation Data

E. Hamunyela, S. Rosca, A. Mirt, E. Engle, M. Herold, F. Gieseke, and J. Verbesselt.

JOURNALRemote Sensing 2020

Abs Bib HTML

Monitoring of abnormal changes on the earth’s surface (e.g., forest disturbance) has improved greatly in recent years because of satellite remote sensing. However, high computational costs inherently associated with processing and analysis of satellite data often inhibit large-area and sub-annual monitoring. Normal seasonal variations also complicate the detection of abnormal changes at sub-annual scale in the time series of satellite data. Recently, however, computationally powerful platforms, such as the Google Earth Engine (GEE), have been launched to support large-area analysis of satellite data. Change detection methods with the capability to detect abnormal changes in time series data while accounting for normal seasonal variations have also been developed but are computationally intensive. Here, we report an implementation of BFASTmonitor (Breaks For Additive Season and Trend monitor) on GEE to support large-area and sub-annual change monitoring using satellite data available in GEE. BFASTmonitor is a data-driven unsupervised change monitoring approach that detects abnormal changes in time series data, with near real-time monitoring capabilities. Although BFASTmonitor has been widely used in forest cover loss monitoring, it is a generic change monitoring approach that can be used to monitor changes in a various time series data. Using Landsat time series for normalised difference moisture index (NDMI), we evaluated the performance of our GEE BFASTmonitor implementation (GEE BFASTmonitor) by detecting forest disturbance at three forest areas (humid tropical forest, dry tropical forest, and miombo woodland) while comparing it to the original R-based BFASTmonitor implementation (original BFASTmonitor). A map-to-map comparison showed that the spatial and temporal agreements on forest disturbance between the original and our GEE BFASTmonitor implementations were high. At each site, the spatial agreement was more than 97%, whereas the temporal agreement was over 94%. The high spatial and temporal agreement show that we have properly translated and implemented the BFASTmonitor algorithm on GEE. Naturally, due to different numerical solvers being used for regression model fitting in R and GEE, small differences could be observed in the outputs. These differences were most noticeable at the dry tropical forest and miombo woodland sites, where the forest exhibits strong seasonality. To make GEE BFASTmonitor accessible to non-technical users, we developed a web application with simplified user interface. We also created a JavaScript-based GEE BFASTmonitor package that can be imported as a module. Overall, our GEE BFASTmonitor implementation fills an important gap in large-area environmental change monitoring using earth observation data.
@article{HamunyelaRMEHGV2020Implementation, author = {Hamunyela, Eliakim and Rosca, Sabina and Mirt, Andrei and Engle, Eric and Herold, Martin and Gieseke, Fabian and Verbesselt., Jan}, title = {Implementation of BFASTmonitor Algorithm on Google Earth Engine to Support Large-Area and Sub-Annual Change Monitoring Using Earth Observation Data}, journal = {Remote Sensing}, year = {2020}, volume = {12}, number = {18}, doi = {10.3390/rs12182953}, tags = {application,rs} }
Magnitude and Uncertainty Pruning Criterion for Neural Networks

V. Ko, S. Oehmcke, and F. Gieseke

IEEE BIGDATA 2019 2019 IEEE International Conference on Big Data (IEEE BigData) 2019

Abs Bib

Neural networks have achieved dramatic improvements in recent years and depict the state-of-the-art methods for many real-world tasks nowadays. One drawback is, however, that many of these models are overparameterized, which makes them both computationally and memory intensive. Furthermore, overparameterization can also lead to undesired overfitting side-effects. Inspired by recently proposed magnitude-based pruning schemes and the Wald test from the field of statistics, we introduce a novel magnitude and uncertainty (M&U) pruning criterion that helps to lessen such shortcomings. One important advantage of our M&U pruning criterion is that it is scale-invariant, a phenomenon that the magnitude-based pruning criterion suffers from. In addition, we present a “pseudo bootstrap” scheme, which can efficiently estimate the uncertainty of the weights by using their update information during training. Our experimental evaluation, which is based on various neural network architectures and datasets, shows that our new criterion leads to more compressed models compared to models that are solely based on magnitude-based pruning criteria, with, at the same time, less loss in predictive power.
@inproceedings{KoOG2019MagnitudeAnd, author = {Ko, Vinnie and Oehmcke, Stefan and Gieseke, Fabian}, title = {Magnitude and Uncertainty Pruning Criterion for Neural Networks}, booktitle = {2019 {IEEE} International Conference on Big Data {(IEEE} BigData)}, pages = {2317--2326}, year = {2019}, publisher = {IEEE}, address = {Los Angeles, USA}, doi = {10.1109/BigData47090.2019.9005692}, url = {https://doi.org/10.1109/BigData47090.2019.9005692}, tags = {ml}, }
Detecting Hardly Visible Roads in Low-Resolution Satellite Time Series Data

S. Oehmcke, C. Thrysøe, A. Borgstad, M. V. Salles, M. Brandt, and F. Gieseke

IEEE BIGDATA 2019 2019 IEEE International Conference on Big Data (IEEE BigData) 2019

Abs Bib HTML PDF

Massive amounts of satellite data have been gathered over time, holding the potential to unveil a spatiotemporal chronicle of the surface of Earth. These data allow scientists to investigate various important issues, such as land use changes, on a global scale. However, not all land-use phenomena are equally visible on satellite imagery. In particular, the creation of an inventory of the planet’s road infrastructure remains a challenge, despite being crucial to analyze urbanization patterns and their impact. Towards this end, this work advances data-driven approaches for the automatic identification of roads based on open satellite data. Given the typical resolutions of these historical satellite data, we observe that there is inherent variation in the visibility of different road types. Based on this observation, we propose two deep learning frameworks that extend state-of-the-art deep learning methods by formalizing road detection as an ordinal classification task. In contrast to related schemes, one of the two models also resorts to satellite time series data that are potentially affected by missing data and cloud occlusion. Taking these time series data into account eliminates the need to manually curate datasets of high-quality image tiles, substantially simplifying the application of such models on a global scale. We evaluate our approaches on a dataset that is based on Sentinel 2 satellite imagery and OpenStreetMap vector data. Our results indicate that the proposed models can successfully identify large and medium-sized roads. We also discuss opportunities and challenges related to the detection of roads and other infrastructure on a global scale.
@inproceedings{OehmckeTBMBG2019DetectingHardly, author = {Oehmcke, Stefan and Thrysøe, Christoph and Borgstad, Andreas and Salles, Marcos Vaz and Brandt, Martin and Gieseke, Fabian}, title = {Detecting Hardly Visible Roads in Low-Resolution Satellite Time Series Data}, booktitle = {2019 {IEEE} International Conference on Big Data {(IEEE} BigData)}, pages = {2403--2412}, year = {2019}, publisher = {IEEE}, doi = {10.1109/BigData47090.2019.9006251}, tags = {ml,application,rs}, }
Training Big Random Forests with Little Resources

F. Gieseke, and C. Igel

KDD18 Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, London, UK, August 19-23, 2018 2018

Abs Bib HTML PDF

Without access to large compute clusters, building random forests on large datasets is still a challenging problem. This is, in particular, the case if fully-grown trees are desired. We propose a simple yet effective framework that allows to efficiently construct ensembles of huge trees for hundreds of millions or even billions of training instances using a cheap desktop computer with commodity hardware. The basic idea is to consider a multi-level construction scheme, which builds top trees for small random subsets of the available data and which subsequently distributes all training instances to the top trees’ leaves for further processing. While being conceptually simple, the overall efficiency crucially depends on the particular implementation of the different phases. The practical merits of our approach are demonstrated using dense datasets with hundreds of millions of training instances.
@inproceedings{GiesekeI2018TrainingBig, author = {Gieseke, Fabian and Igel, Christian}, title = {Training Big Random Forests with Little Resources}, booktitle = {Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, London, UK, August 19-23, 2018}, pages = {1445--1454}, year = {2018}, publisher = {ACM}, doi = {10.1145/3219819.3220124}, tags = {ml,de}, }
Bigger Buffer k-d Trees on Multi-Many-Core Systems

F. Gieseke, C. E. Oancea, A. Mahabal, C. Igel, and T. Heskes

High Performance Computing for Computational Science — VECPAR 2018 — 13th International Conference, São Pedro, Brazil, September 17-19, 2018, Revised Selected Papers 2018

Abs Bib HTML PDF

A buffer k-d tree is a k-d tree variant for massively-parallel nearest neighbor search. While providing valuable speed-ups on modern many-core devices in case both a large number of reference and query points are given, buffer k-d trees are limited by the amount of points that can fit on a single device. In this work, we show how to modify the original data structure and the associated workflow to make the overall approach capable of dealing with massive data sets. We further provide a simple yet efficient way of using multiple devices given in a single workstation. The applicability of the modified framework is demonstrated in the context of astronomy, a field that is faced with huge amounts of data.
@inproceedings{GiesekeOMIH2018BiggerBuffer, author = {Gieseke, Fabian and Oancea, Cosmin Eugen and Mahabal, Ashish and Igel, Christian and Heskes, Tom}, title = {Bigger Buffer k-d Trees on Multi-Many-Core Systems}, booktitle = {High Performance Computing for Computational Science --- VECPAR 2018 --- 13th International Conference, São Pedro, Brazil, September 17-19, 2018, Revised Selected Papers}, pages = {202--214}, series = {Lecture Notes in Computer Science}, volume = {11333}, year = {2018}, publisher = {Springer}, doi = {10.1007/978-3-030-15996-2\_15}, tags = {de}, }
Massively-parallel break detection for satellite data

M. Mehren, F. Gieseke, J. Verbesselt, S. Rosca, S. Horion, and A. Zeileis

Proceedings of the 30th International Conference on Scientific and Statistical Database Management, SSDBM 2018, Bozen-Bolzano, Italy, July 09-11, 2018 2018

Abs Bib HTML PDF

The field of remote sensing is nowadays faced with huge amounts of data. While this offers a variety of exciting research opportunities, it also yields significant challenges regarding both computation time and space requirements. In practice, the sheer data volumes render existing approaches too slow for processing and analyzing all the available data. This work aims at accelerating BFAST, one of the state-of-the-art methods for break detection given satellite image time series. In particular, we propose a massively-parallel implementation for BFAST that can effectively make use of modern parallel compute devices such as GPUs. Our experimental evaluation shows that the proposed GPU implementation is up to four orders of magnitude faster than the existing publicly available implementation and up to ten times faster than a corresponding multi-threaded CPU execution. The dramatic decrease in running time renders the analysis of significantly larger datasets possible in seconds or minutes instead of hours or days. We demonstrate the practical benefits of our implementations given both artificial and real datasets.
@inproceedings{MehrenGVRHZ2018MassivelyParallel, author = {Mehren, Malte and Gieseke, Fabian and Verbesselt, Jan and Rosca, Sabine and Horion, Stefanie and Zeileis, Achim}, title = {Massively-parallel break detection for satellite data}, booktitle = {Proceedings of the 30th International Conference on Scientific and Statistical Database Management, SSDBM 2018, Bozen-Bolzano, Italy, July 09-11, 2018}, pages = {5:1-5:10}, year = {2018}, publisher = {ACM}, doi = {10.1145/3221269.3223032}, tags = {de}, }
Return of the features — Efficient feature selection and interpretation for photometric redshifts

A. D’Isanto, S. Cavuoti, F. Gieseke, and K. L. Polsterer

Astronomy & Astrophysics 2018

Abs Bib HTML PDF

The explosion of data in recent years has generated an increasing need for new analysis techniques in order to extract knowledge from massive datasets. Machine learning has proved particularly useful to perform this task. Fully automatized methods have recently gathered great popularity, even though those methods often lack physical interpretability. In contrast, feature based approaches can provide both well-performing models and understandable causalities with respect to the correlations found between features and physical processes. Efficient feature selection is an essential tool to boost the performance of machine learning models. In this work, we propose a forward selection method in order to compute, evaluate, and characterize better performing features for regression and classification problems. Given the importance of photometric redshift estimation, we adopt it as our case study. We synthetically created 4,520 features by combining magnitudes, errors, radii, and ellipticities of quasars, taken from the SDSS. We apply a forward selection process, a recursive method in which a huge number of feature sets is tested through a kNN algorithm, leading to a tree of feature sets. The branches of the tree are then used to perform experiments with the random forest, in order to validate the best set with an alternative model. We demonstrate that the sets of features determined with our approach improve the performances of the regression models significantly when compared to the performance of the classic features from the literature. The found features are unexpected and surprising, being very different from the classic features. Therefore, a method to interpret some of the found features in a physical context is presented. The methodology described here is very general and can be used to improve the performance of machine learning models for any regression or classification task.
@article{DIsantoCGP2018ReturnOfThe, author = {D'Isanto, Antonio and Cavuoti, Stefano and Gieseke, Fabian and Polsterer, Kai Lars}, title = {Return of the features --- Efficient feature selection and interpretation for photometric redshifts}, journal = {Astronomy & Astrophysics}, year = {2018}, volume = {616}, pages = {A97}, doi = {10.1051/0004-6361/201833103}, tags = {application}, }
Artistic movement recognition by consensus of boosted SVM based experts

C. Florea, and F. Gieseke

Journal of Visual Communication and Image Representation 2018

Abs Bib HTML

In this work we aim to automatically recognize the artistic movement from a digitized image of a painting. Our approach uses a new system that resorts to descriptions induced by color structure histograms and by novel topographical features for texture assessment. The topographical descriptors accumulate information from the first and second local derivatives within four layers of finer representations. The classification is performed by two layers of ensembles. The first is an adapted boosted ensemble of support vector machines, which introduces further randomization over feature categories as a regularization. The training of the ensemble yields individual experts by isolating initially misclassified images and by correcting them in further stages of the process. The solution improves the performance by a second layer build upon the consensus of multiple local experts that analyze different parts of the images. The resulting performance compares favorably with classical solutions and manages to match the ones of modern deep learning frameworks.
@article{FloreaG2018ArtisticMovement, author = {Florea, Corneliu and Gieseke, Fabian}, title = {Artistic movement recognition by consensus of boosted SVM based experts}, journal = {Journal of Visual Communication and Image Representation}, year = {2018}, volume = {56}, pages = {220--233}, doi = {10.1016/j.jvcir.2018.09.015}, tags = {application}, }
Convolutional Neural Networks for Transient Candidate Vetting in Large-Scale Surveys

F. Gieseke, S. Bloemen, C. Bogaard, T. Heskes, J. Kindler, R. A. Scalzo, V. A. Ribeiro, J. Roestel, P. J. Groot, F. Yuan, A. Möller, and B. E. Tucker

JOURNALMonthly Notices of the Royal Astronomical Society (MNRAS) 2017

Abs Bib HTML PDF

Current synoptic sky surveys monitor large areas of the sky to find variable and transient astronomical sources. As the number of detections per night at a single telescope easily exceeds several thousand, current detection pipelines make intensive use of machine learning algorithms to classify the detected objects and to filter out the most interesting candidates. A number of upcoming surveys will produce up to three orders of magnitude more data, which renders high-precision classification systems essential to reduce the manual and, hence, expensive vetting by human experts. We present an approach based on convolutional neural networks to discriminate between true astrophysical sources and artefacts in reference-subtracted optical images. We show that relatively simple networks are already competitive with state-of-the-art systems and that their quality can further be improved via slightly deeper networks and additional pre-processing steps – eventually yielding models outperforming state-of-the-art systems. In particular, our best model correctly classifies about 97.3 per cent of all ‘real’ and 99.7 per cent of all ‘bogus’ instances on a test set containing 1942 ‘bogus’ and 227 ‘real’ instances in total. Furthermore, the networks considered in this work can also successfully classify these objects at hand without relying on difference images, which might pave the way for future detection pipelines not containing image subtraction steps at all.
@article{GiesekeEtAl2017, author = {Gieseke, Fabian and Bloemen, Steven and van den Bogaard, Cas and Heskes, Tom and Kindler, Jonas and Scalzo, Richard A. and Ribeiro, Valerio A.R.M. and van Roestel, Jan and Groot, Paul J. and Yuan, Fang and Möller, Anais and Tucker, Brad E.}, title = {Convolutional Neural Networks for Transient Candidate Vetting in Large-Scale Surveys}, journal = {Monthly Notices of the Royal Astronomical Society {(MNRAS)}}, year = {2017}, pages = {3101-3114}, volume = {472}, number = {3}, publisher = {Oxford University Press}, tags = {application,ml}, }
A probabilistic approach to emission-line galaxy classification

R. S. Souza, M. L. L. Dantas, M. V. Costa-Duarte, E. D. Feigelson, M. Killedar, P. Lablanche, R. Vilalta, A. Krone-Martins, R. Beck, and F. Gieseke

Monthly Notices of the Royal Astronomical Society (MNRAS) 2017

Abs Bib HTML PDF

We invoke a Gaussian mixture model (GMM) to jointly analyse two traditional emission-line classification schemes of galaxy ionization sources: the Baldwin-Phillips-Terlevich (BPT) and vs. [NII]/H (WHAN) diagrams, using spectroscopic data from the Sloan Digital Sky Survey Data Release 7 and SEAGal/STARLIGHT datasets. We apply a GMM to empirically define classes of galaxies in a three-dimensional space spanned by the [OIII]/H , [NII]/H , and EW(H ), optical parameters. The best-fit GMM based on several statistical criteria suggests a solution around four Gaussian components (GCs), which are capable to explain up to 97 per cent of the data variance. Using elements of information theory, we compare each GC to their respective astronomical counterpart. GC1 and GC4 are associated with star-forming galaxies, suggesting the need to define a new starburst subgroup. GC2 is associated with BPT’s Active Galaxy Nuclei (AGN) class and WHAN’s weak AGN class. GC3 is associated with BPT’s composite class and WHAN’s strong AGN class. Conversely, there is no statistical evidence – based on four GCs – for the existence of a Seyfert/LINER dichotomy in our sample. Notwithstanding, the inclusion of an additional GC5 unravels it. The GC5 appears associated to the LINER and Passive galaxies on the BPT and WHAN diagrams respectively. Subtleties aside, we demonstrate the potential of our methodology to recover/unravel different objects inside the wilderness of astronomical datasets, without lacking the ability to convey physically interpretable results. The probabilistic classifications from the GMM analysis are publicly available within the COINtoolbox.
@article{SouzaEtAl2017, author = {de Souza, R. S. and Dantas, M. L. L. and Costa-Duarte, M. V. and Feigelson, E. D. and Killedar, M. and Lablanche, P.-Y. and Vilalta, R. and Krone-Martins, A. and Beck, R. and Gieseke, Fabian}, title = {A probabilistic approach to emission-line galaxy classification}, journal = {Monthly Notices of the Royal Astronomical Society {(MNRAS)}}, year = {2017}, publisher = {Oxford University Press}, tags = {application} }
Big Universe, Big Data: Machine Learning and Image Analysis for Astronomy

J. Kremer, K. Stensbo-Smidt, F. Gieseke, K. S. Pedersen, and C. Igel

IEEE Intelligent Systems 2017

Abs Bib HTML PDF

Astrophysics and cosmology are rich with data. The advent of wide-area digital cameras on large aperture telescopes has led to ever more ambitious surveys of the sky. Data volumes of entire surveys a decade ago can now be acquired in a single night and real-time analysis is often desired. Thus, modern astronomy requires big data know-how, in particular it demands highly efficient machine learning and image analysis algorithms. But scalability is not the only challenge: Astronomy applications touch several current machine learning research questions, such as learning from biased data and dealing with label and measurement noise. We argue that this makes astronomy a great domain for computer science research, as it pushes the boundaries of data analysis. In the following, we will present this exciting application area for data scientists. We will focus on exemplary results, discuss main challenges, and highlight some recent methodological advancements in machine learning and image analysis triggered by astronomical applications.
@article{KremerSGPI17, author = {Kremer, Jan and Stensbo{-}Smidt, Kristoffer and Gieseke, Fabian and Pedersen, Kim Steenstrup and Igel, Christian}, title = {Big Universe, Big Data: Machine Learning and Image Analysis for Astronomy}, journal = {{IEEE} Intelligent Systems}, volume = {32}, number = {2}, pages = {16--22}, year = {2017}, doi = {10.1109/MIS.2017.40}, tags = {application}, }
bufferkdtree: A Python library for massive nearest neighbor queries on multi-many-core devices

F. Gieseke, C. E. Oancea, and C. Igel

Knowledge Based Systems 2017

Abs Bib HTML

The bufferkdtree package is an open-source software that provides an efficient implementation for processing huge amounts of nearest neighbor queries in Euclidean spaces of moderate dimensionality. Its underlying implementation resorts to a variant of the classical k-d tree data structure, called buffer k-d tree, which can be used to efficiently perform bulk nearest neighbor searches on modern many-core devices. The package, which is based on Python, C, and OpenCL, is made publicly available online at https://github.com/gieseke/bufferkdtree under the GPLv2 license.
@article{GiesekeOI17, author = {Gieseke, Fabian and Oancea, Cosmin E. and Igel, Christian}, title = {bufferkdtree: {A} Python library for massive nearest neighbor queries on multi-many-core devices}, journal = {Knowledge Based Systems}, volume = {120}, pages = {1--3}, year = {2017}, doi = {10.1016/j.knosys.2017.01.002}, tags = {de,hpc}, }
Massively-parallel best subset selection for ordinary least-squares regression

F. Gieseke, K. L. Polsterer, A. Mahabal, C. Igel, and T. Heskes

2017 IEEE Symposium Series on Computational Intelligence, SSCI 2017, Honolulu, HI, USA, November 27 - Dec. 1, 2017 2017

Abs Bib HTML

Selecting an optimal subset of k out of d features for linear regression models given n training instances is often considered intractable for feature spaces with hundreds or thousands of dimensions. We propose an efficient massively-parallel implementation for selecting such optimal feature subsets in a brute-force fashion for small k. By exploiting the enormous compute power provided by modern parallel devices such as graphics processing units, it can deal with thousands of input dimensions even using standard commodity hardware only. We evaluate the practical runtime using artificial datasets and sketch the applicability of our framework in the context of astronomy.
@inproceedings{GiesekePMIH17, author = {Gieseke, Fabian and Polsterer, Kai Lars and Mahabal, Ashish and Igel, Christian and Heskes, Tom}, title = {Massively-parallel best subset selection for ordinary least-squares regression}, booktitle = {2017 {IEEE} Symposium Series on Computational Intelligence, {SSCI} 2017, Honolulu, HI, USA, November 27 - Dec. 1, 2017}, pages = {1--8}, publisher = {{IEEE}}, year = {2017}, doi = {10.1109/SSCI.2017.8285225}, tags = {de,ml}, }
Deep-learnt classification of light curves

A. Mahabal, K. Sheth, F. Gieseke, A. Pai, S. G. Djorgovski, A. J. Drake, and M. J. Graham

2017 IEEE Symposium Series on Computational Intelligence, SSCI 2017, Honolulu, HI, USA, November 27 - Dec. 1, 2017 2017

Abs Bib HTML PDF

Astronomy light curves are sparse, gappy, and heteroscedastic. As a result standard time series methods regularly used for financial and similar datasets are of little help and astronomers are usually left to their own instruments and techniques to classify light curves. A common approach is to derive statistical features from the time series and to use machine learning methods, generally supervised, to separate objects into a few of the standard classes. In this work, we transform the time series to two-dimensional light curve representations in order to classify them using modern deep learning techniques. In particular, we show that convolutional neural networks based classifiers work well for broad characterization and classification. We use labeled datasets of periodic variables from CRTS survey and show how this opens doors for a quick classification of diverse classes with several possible exciting extensions.
@inproceedings{MahabalSGPDDG17, author = {Mahabal, Ashish and Sheth, Kshiteej and Gieseke, Fabian and Pai, A. and Djorgovski, S. George and Drake, Andrew J. and Graham, Matthew J.}, title = {Deep-learnt classification of light curves}, booktitle = {2017 {IEEE} Symposium Series on Computational Intelligence, {SSCI} 2017, Honolulu, HI, USA, November 27 - Dec. 1, 2017}, pages = {1--8}, publisher = {{IEEE}}, year = {2017}, doi = {10.1109/SSCI.2017.8280984}, tags = {application}, }
Artistic Movement Recognition by Boosted Fusion of Color Structure and Topographic Description

C. Florea, C. Toca, and F. Gieseke

2017 IEEE Winter Conference on Applications of Computer Vision, WACV 2017, Santa Rosa, CA, USA, March 24-31, 2017 2017

Abs Bib HTML

We address the problem of automatically recognizing artistic movement in digitized paintings. We make the fol- lowing contributions: Firstly, we introduce a large digitized painting database that contains refined annotations of artis- tic movement. Secondly, we propose a new system for the automatic categorization that resorts to image descriptions by color structure and novel topographical features as well as to an adapted boosted ensemble of support vector ma- chines. The system manages to isolate initially misclassi- fied images and to correct such errors in further stages of the boosting process. The resulting performance of the sys- tem compares favorably with classical solutions in terms of accuracy and even manages to outperform modern deep learning frameworks.
@inproceedings{FloreaTG17, author = {Florea, Corneliu and Toca, Cosmin and Gieseke, Fabian}, title = {Artistic Movement Recognition by Boosted Fusion of Color Structure and Topographic Description}, booktitle = {2017 {IEEE} Winter Conference on Applications of Computer Vision, {WACV} 2017, Santa Rosa, CA, USA, March 24-31, 2017}, pages = {569--577}, publisher = {{IEEE} Computer Society}, year = {2017}, doi = {10.1109/WACV.2017.69}, tags = {application}, }
On the realistic validation of photometric redshifts

R. Beck, C. A. Lin, E. O. Ishida, F. Gieseke, R. S. Souza, M. V. Costa-Duarte, M. W. Hattab, and A. Krone-Martins

Monthly Notices of the Royal Astronomical Society (MNRAS) 2017

Abs Bib HTML PDF

Two of the main problems encountered in the development and accurate validation of photometric redshift (photo-z) techniques are the lack of spectroscopic coverage in feature space (e.g. colours and magnitudes) and the mismatch between photometric error distributions associated with the spectroscopic and photometric samples. Although these issues are well known, there is currently no standard benchmark allowing a quantitative analysis of their impact on the final photo-z estimation. In this work, we present two galaxy catalogues, Teddy and Happy, built to enable a more demanding and realistic test of photo-z methods. Using photometry from the Sloan Digital Sky Survey and spectroscopy from a collection of sources, we constructed datasets which mimic the biases between the underlying probability distribution of the real spectroscopic and photometric sample. We demonstrate the potential of these catalogues by submitting them to the scrutiny of different photo-z methods, including machine learning (ML) and template fitting approaches. Beyond the expected bad results from most ML algorithms for cases with missing coverage in feature space, we were able to recognize the superiority of global models in the same situation and the general failure across all types of methods when incomplete coverage is convoluted with the presence of photometric errors - a data situation which photo-z methods were not trained to deal with up to now and which must be addressed by future large scale surveys. Our catalogues represent the first controlled environment allowing a straightforward implementation of such tests.
@article{BeckEtAl2017, author = {Beck, Robert and Lin, Chieh A. and Ishida, Emille O. and Gieseke, Fabian and de Souza, Rafel S. and Costa-Duarte, Marcus V. and Hattab, Mohammed W. and Krone-Martins, Alberto}, title = {On the realistic validation of photometric redshifts}, journal = {Monthly Notices of the Royal Astronomical Society {(MNRAS)}}, year = {2017}, volume = {468}, number = {4}, pages = {4323-4339}, publisher = {Oxford University Press}, tags = {application}, }
Sacrificing information for the greater good: how to select photometric bands for optimal accuracy

K. Stensbo-Smidt, F. Gieseke, A. Zirm, K. S. Pedersen, and C. Igel

Monthly Notices of the Royal Astronomical Society (MNRAS) 2017

Abs Bib HTML PDF

Large-scale surveys make huge amounts of photometric data available. Because of the sheer amount of objects, spectral data cannot be obtained for all of them. Therefore it is important to devise techniques for reliably estimating physical properties of objects from photometric information alone. These estimates are needed to automatically identify interesting objects worth a follow-up investigation as well as to produce the required data for a statistical analysis of the space covered by a survey. We argue that machine learning techniques are suitable to compute these estimates accurately and efficiently. This study promotes a feature selection algorithm, which selects the most informative magnitudes and colours for a given task of estimating physical quantities from photometric data alone. Using k nearest neighbours regression, a well-known non-parametric machine learning method, we show that using the found features significantly increases the accuracy of the estimations compared to using standard features and standard methods. We illustrate the usefulness of the approach by estimating specific star formation rates (sSFRs) and redshifts (photo-z’s) using only the broad-band photometry from the Sloan Digital Sky Survey (SDSS). For estimating sSFRs, we demonstrate that our method produces better estimates than traditional spectral energy distribution (SED) fitting. For estimating photo-z’s, we show that our method produces more accurate photo-z’s than the method employed by SDSS. The study highlights the general importance of performing proper model selection to improve the results of machine learning systems and how feature selection can provide insights into the predictive relevance of particular input features.
@article{StensboSmidt2016, author = {Stensbo-Smidt, Kristoffer and Gieseke, Fabian and Zirm, Andrew and Pedersen, Kim Steenstrup and Igel, Christian}, title = {Sacrificing information for the greater good: how to select photometric bands for optimal accuracy}, journal = {Monthly Notices of the Royal Astronomical Society {(MNRAS)}}, year = {2017}, volume = {464}, number = {3}, pages = {2577-2596}, publisher = {Oxford University Press}, tags = {application}, }
Exploring the spectroscopic diversity of type Ia supernovae with DRACULA: A machine learning approach

M. Sasdelli, E. O. Ishida, R. Vilalta, M. Aguena, V. C. Busti, H. Camacho, A. M. M. Trindade, F. Gieseke, R. S. Souza, Y. T. Fantaye, and P. A. Mazzali

Monthly Notices of the Royal Astronomical Society (MNRAS) 2016

Abs Bib HTML PDF

The existence of multiple subclasses of Type Ia supernovae (SNe Ia) has been the subject of great debate in the last decade. One major challenge inevitably met when trying to infer the existence of one or more subclasses is the time consuming, and subjective, process of subclass definition. In this work, we show how machine learning tools facilitate identification of subtypes of SNe Ia through the establishment of a hierarchical group structure in the continuous space of spectral diversity formed by these objects. Using deep learning, we were capable of performing such identification in a four-dimensional feature space (+1 for time evolution), while the standard principal component analysis barely achieves similar results using 15 principal components. This is evidence that the progenitor system and the explosion mechanism can be described by a small number of initial physical parameters. As a proof of concept, we show that our results are in close agreement with a previously suggested classification scheme and that our proposed method can grasp the main spectral features behind the definition of such subtypes. This allows the confirmation of the velocity of lines as a first-order effect in the determination of SN Ia subtypes, followed by 91bg-like events. Given the expected data deluge in the forthcoming years, our proposed approach is essential to allow a quick and statistically coherent identification of SNe Ia subtypes (and outliers). All tools used in this work were made publicly available in the python package Dimensionality Reduction And Clustering for Unsupervised Learning in Astronomy (dracula) and can be found within COINtoolbox (https://github.com/COINtoolbox/DRACULA).
@article{Sasdelli2016, author = {Sasdelli, Michele and Ishida, E. O. and Vilalta, R. and Aguena, M. and Busti, V. C. and Camacho, H. and Trindade, A. M. M. and Gieseke, Fabian and de Souza, R. S. and Fantaye, Y. T. and Mazzali, P. A.}, title = {Exploring the spectroscopic diversity of type Ia supernovae with {DRACULA}: {A} machine learning approach}, journal = {Monthly Notices of the Royal Astronomical Society (MNRAS)}, year = {2016}, volume = {461}, number = {2}, pages = {2044-2059}, publisher = {Oxford University Press}, tags = {application}, }
Exploring the spectroscopic diversity of type Ia supernovae with Deep Learning and Unsupervised Clustering

E. E. O. Ishida, M. Sasdelli, R. Vilalta, M. Aguena, V. C. Busti, H. Camacho, A. M. M. Trindade, F. Gieseke, R. S. Souza, Y. T. Fantaye, and P. A. Mazzali

Astroinformatics 2016, Sorrento, Italy, October 19-25, 2016 2016

Abs Bib HTML

The existence of multiple subclasses of type Ia supernovae (SNeIa) has been the subject of great debate in the last decade. In this work, we show how machine learning tools facilitate identification of subtypes of SNe Ia. Using Deep Learning for dimensionality reduction, we were capable of performing such identification in a parameter space of significantly lower dimension than its principal component analysis counterpart. This is evidence that the progenitor system and the explosion mechanism can be described with a small number of initial physical parameters. All tools used here are publicly available in the Python package DRACULA (Dimensionality Reduction And Clustering for Unsupervised Learning in Astronomy) and can be found within COINtoolbox (https://github.com/COINtoolbox/DRACULA).
@inproceedings{IshidaSVABCTGSF16, author = {Ishida, Emille E. O. and Sasdelli, Michele and Vilalta, Ricardo and Aguena, Michel and Busti, Vinicius C. and Camacho, Hugo and Trindade, Arlindo M. M. and Gieseke, Fabian and de Souza, Rafael S. and Fantaye, Yabebal T. and Mazzali, Paolo A.}, editor = {Brescia, Massimo and Djorgovski, S. George and Feigelson, Eric D. and Longo, Giuseppe and Cavuoti, Stefano}, title = {Exploring the spectroscopic diversity of type Ia supernovae with Deep Learning and Unsupervised Clustering}, booktitle = {Astroinformatics 2016, Sorrento, Italy, October 19-25, 2016}, series = {Proceedings of the International Astronomical Union}, volume = {12}, number = {{S325}}, pages = {247--252}, publisher = {Cambridge University Press}, year = {2016}, tags = {application}, }

Parallelized rotation and flipping INvariant Kohonen maps (PINK) on GPUs

K. L. Polsterer, F. Gieseke, C. Igel, B. Doser, and N. Gianniotis

24th European Symposium on Artificial Neural Networks, ESANN 2016, Bruges, Belgium, April 27-29, 2016 2016

Bib HTML PDF

@inproceedings{PolstererGIDG16,
  author = {Polsterer, Kai Lars and Gieseke, Fabian and Igel, Christian and Doser, Bernd and Gianniotis, Nikolaos},
  title = {Parallelized rotation and flipping INvariant Kohonen maps {(PINK)}
                 on GPUs},
  booktitle = {24th European Symposium on Artificial Neural Networks, {ESANN} 2016,
                 Bruges, Belgium, April 27-29, 2016},
  year = {2016},
  tags = {de,application,hpc},
}

Nearest Neighbor Density Ratio Estimation for Large-Scale Applications in Astronomy

J. Kremer, F. Gieseke, K. S. Pedersen, and C. Igel

Astronomy and Computing 2015

Abs Bib HTML

In astronomical applications of machine learning, the distribution of objects used for building a model is often different from the distribution of the objects the model is later applied to. This is known as sample selection bias, which is a major challenge for statistical inference as one can no longer assume that the labeled training data are representative. To address this issue, one can re-weight the labeled training patterns to match the distribution of unlabeled data that are available already in the training phase. There are many examples in practice where this strategy yielded good results, but estimating the weights reliably from a finite sample is challenging. We consider an efficient nearest neighbor density ratio estimator that can exploit large samples to increase the accuracy of the weight estimates. To solve the problem of choosing the right neighborhood size, we propose to use cross-validation on a model selection criterion that is unbiased under covariate shift. The resulting algorithm is our method of choice for density ratio estimation when the feature space dimensionality is small and sample sizes are large. The approach is simple and, because of the model selection, robust. We empirically find that it is on a par with established kernel-based methods on relatively small regression benchmark datasets. However, when applied to large-scale photometric redshift estimation, our approach outperforms the state-of-the-art.
@article{KremerGPI2015, author = {Kremer, Jan and Gieseke, Fabian and Pedersen, Kim Steenstrup and Igel, Christian}, title = {Nearest Neighbor Density Ratio Estimation for Large-Scale Applications in Astronomy}, journal = {Astronomy and Computing}, volume = {12}, pages = {62--72}, year = {2015}, tags = {application} }
Batch Steepest-Descent-Mildest-Ascent for Interactive Maximum Margin Clustering

F. Gieseke, T. Pahikkala, and T. Heskes

Advances in Intelligent Data Analysis XIV - 14th International Symposium, IDA 2015, Saint Etienne, France, October 22-24, 2015, Proceedings 2015

Abs Bib HTML

The maximum margin clustering principle extends support vector machines to unsupervised scenarios. We present a variant of this clustering scheme that can be used in the context of interactive clustering scenarios. In particular, our approach permits the class ratios to be manually defined by the user during the fitting process. Our framework can be used at early stages of the data mining process when no or very little information is given about the true clusters and class ratios. One of the key contributions is an adapted steepest-descent-mildest-ascent optimization scheme that can be used to fine-tune maximum margin clustering solutions in an interactive manner. We demonstrate the applicability of our approach in the context of remote sensing and astronomy with training sets consisting of hundreds of thousands of patterns.
@inproceedings{GiesekePH15, author = {Gieseke, Fabian and Pahikkala, Tapio and Heskes, Tom}, editor = {{\'{E}}lisa Fromont and Bie, Tijl De and van Leeuwen, Matthijs}, title = {Batch Steepest-Descent-Mildest-Ascent for Interactive Maximum Margin Clustering}, booktitle = {Advances in Intelligent Data Analysis {XIV} - 14th International Symposium, {IDA} 2015, Saint Etienne, France, October 22-24, 2015, Proceedings}, series = {Lecture Notes in Computer Science}, volume = {9385}, pages = {95--107}, publisher = {Springer}, year = {2015}, doi = {10.1007/978-3-319-24465-5\_9}, tags = {de} }
An Efficient Many-Core Implementation for Semi-Supervised Support Vector Machines

F. Gieseke

Machine Learning, Optimization, and Big Data - First International Workshop, MOD 2015, Taormina, Sicily, Italy, July 21-23, 2015, Revised Selected Papers 2015

Abs Bib HTML

The concept of semi-supervised support vector machines extends classical support vector machines to learning scenarios, where both labeled and unlabeled patterns are given. In recent years, such semi-supervised extensions have gained considerable attention due to their huge potential for real-world applications with only small amounts of labeled data. While being appealing from a practical point of view, semi-supervised support vector machines lead to a combinatorial optimization problem that is difficult to address. Many optimization approaches have been proposed that aim at tackling this task. However, the computational requirements can still be very high, especially in case large data sets are considered and many model parameters need to be tuned. A recent trend in the field of big data analytics is to make use of graphics processing units to speed up computationally intensive tasks. In this work, such a massively-parallel implementation is developed for semi-supervised support vector machines. The experimental evaluation, conducted on commodity hardware, shows that valuable speed-ups of up to two orders of magnitude can be achieved over a standard single-core CPU execution.
@inproceedings{Gieseke15, author = {Gieseke, Fabian}, editor = {Pardalos, Panos M. and Pavone, Mario and Farinella, Giovanni Maria and Cutello, Vincenzo}, title = {An Efficient Many-Core Implementation for Semi-Supervised Support Vector Machines}, booktitle = {Machine Learning, Optimization, and Big Data - First International Workshop, {MOD} 2015, Taormina, Sicily, Italy, July 21-23, 2015, Revised Selected Papers}, series = {Lecture Notes in Computer Science}, volume = {9432}, pages = {145--157}, publisher = {Springer}, year = {2015}, doi = {10.1007/978-3-319-27926-8\_13}, tags = {hpc}, }
Fast and simple gradient-based optimization for semi-supervised support vector machines

F. Gieseke, A. Airola, T. Pahikkala, and O. Kramer

Neurocomputing 2014

Abs Bib HTML PDF

One of the main learning tasks in machine learning is the one of classifying data items. The basis for such a task is usually a training set consisting of labeled patterns. In real-world settings, however, such labeled data are usually scarce, and the corresponding models might yield unsatisfying results. Unlabeled data, on the other hand, can often be obtained in huge quantities without much additional effort. A prominent research direction in the field of machine learning is semi-supervised support vector machines. This type of binary classification approach aims at taking the additional information provided by the unlabeled patterns into account to reveal more information about the structure of the data at hand. In some cases, this can yield significantly better classification results compared to a straightforward application of supervised models. One drawback, however, is the fact that generating such models requires solving difficult non-convex optimization tasks. In this work, we present a simple but effective gradient-based optimization framework to address the induced problems. The resulting method can be implemented easily using black-box optimization engines and yields excellent classification and runtime results on both sparse and non-sparse data sets.
@article{GiesekeAPK14, author = {Gieseke, Fabian and Airola, Antti and Pahikkala, Tapio and Kramer, Oliver}, title = {Fast and simple gradient-based optimization for semi-supervised support vector machines}, journal = {Neurocomputing}, volume = {123}, pages = {23--32}, year = {2014}, doi = {10.1016/j.neucom.2012.12.056}, tags = {hpc} }
On Unsupervised Training of Multi-Class Regularized Least-Squares Classifiers

T. Pahikkala, A. Airola, F. Gieseke, and O. Kramer

Journal of Computer Science and Technology (ICDM 2012 Special Issue) 2014

Abs Bib HTML

In this work we present the first efficient algorithm for unsupervised training of multi-class regularized least-squares classifiers. The approach is closely related to the unsupervised extension of the support vector machine classifier known as maximum margin clustering, which recently has received considerable attention, though mostly considering the binary classification case. We present a combinatorial search scheme that combines steepest descent strategies with powerful meta-heuristics for avoiding bad local optima. The regularized least-squares based formulation of the problem allows us to use matrix algebraic optimization enabling constant time checks for the intermediate candidate solutions during the search. Our experimental evaluation indicates the potential of the novel method and demonstrates its superior clustering performance over a variety of competing methods on real world datasets. Both time complexity analysis and experimental comparisons show that the method can scale well to practical sized problems.
@article{PahikkalaAGK14, author = {Pahikkala, Tapio and Airola, Antti and Gieseke, Fabian and Kramer, Oliver}, title = {On Unsupervised Training of Multi-Class Regularized Least-Squares Classifiers}, journal = {Journal of Computer Science and Technology (ICDM 2012 Special Issue)}, volume = {29}, number = {1}, pages = {90--104}, year = {2014}, doi = {10.1007/s11390-014-1414-0}, tags = {ml,de}, }
Speedy greedy feature selection: Better redshift estimation via massive parallelism

F. Gieseke, K. L. Polsterer, C. E. Oancea, and C. Igel

22th European Symposium on Artificial Neural Networks, ESANN 2014, Bruges, Belgium, April 23-25, 2014 2014

Abs Bib HTML

Nearest neighbor models are among the most basic tools in machine learning, and recent work has demonstrated their effectiveness in the field of astronomy. The performance of these models crucially depends on the underlying metric, and in particular on the selection of a meaningful subset of informative features. The feature selection is task-dependent and usually very time-consuming. In this work, we propose an efficient par- allel implementation of incremental feature selection for nearest neighbor models utilizing nowadays graphics processing units. Our framework pro- vides significant computational speed-ups over its sequential single-core competitor of up to two orders of magnitude. We demonstrate the ap- plicability of the overall scheme on one of the most challenging tasks in astronomy: redshift estimation for distant galaxies.
@inproceedings{GiesekePOI14, author = {Gieseke, Fabian and Polsterer, Kai Lars and Oancea, Cosmin Eugen and Igel, Christian}, title = {Speedy greedy feature selection: Better redshift estimation via massive parallelism}, booktitle = {22th European Symposium on Artificial Neural Networks, {ESANN} 2014, Bruges, Belgium, April 23-25, 2014}, year = {2014}, tags = {ml}, }
Buffer k-d Trees: Processing Massive Nearest Neighbor Queries on GPUs

F. Gieseke, J. Heinermann, C. E. Oancea, and C. Igel

ICML14 Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014 2014

Abs Bib HTML PDF

We present a new approach for combining k-d trees and graphics processing units for near- est neighbor search. It is well known that a di- rect combination of these tools leads to a non- satisfying performance due to conditional com- putations and suboptimal memory accesses. To alleviate these problems, we propose a variant of the classical k-d tree data structure, called buffer k-d tree, which can be used to reorganize the search. Our experiments show that we can take advantage of both the hierarchical subdivi- sion induced by k-d trees and the huge computa- tional resources provided by today’s many-core devices. We demonstrate the potential of our ap- proach in astronomy, where hundreds of million nearest neighbor queries have to be processed.
@inproceedings{GiesekeHOI14, author = {Gieseke, Fabian and Heinermann, Justin and Oancea, Cosmin E. and Igel, Christian}, title = {Buffer k-d Trees: Processing Massive Nearest Neighbor Queries on GPUs}, booktitle = {Proceedings of the 31th International Conference on Machine Learning, {ICML} 2014, Beijing, China, 21-26 June 2014}, series = {{JMLR} Workshop and Conference Proceedings}, volume = {32}, pages = {172--180}, publisher = {JMLR.org}, year = {2014}, tags = {ml}, }
A Framework for Data Mining in Wind Power Time Series

O. Kramer, F. Gieseke, J. Heinermann, J. Poloczek, and N. A. Treiber

Data Analytics for Renewable Energy Integration - Second ECML PKDD Workshop, DARE 2014, Nancy, France, September 19, 2014, Revised Selected Papers 2014

Abs Bib HTML

Wind energy is playing an increasingly important part for ecologically friendly power supply. The fast growing infrastructure of wind turbines can be seen as large sensor system that screens the wind energy at a high temporal and spatial resolution. The resulting data- bases consist of huge amounts of wind energy time series data that can be used for prediction, controlling, and planning purposes. In this work, we describe WindML, a Python-based framework for wind energy related machine learning approaches. The main objective of WindML is the continuous development of tools that address important challenges induced by the growing wind energy information infrastructures. Various examples that demonstrate typical use cases are introduced and related research questions are discussed. The different modules of WindML reach from standard machine learning algorithms to advanced techniques for handling missing data and monitoring high-dimensional time series.
@inproceedings{KramerGHPT14, author = {Kramer, Oliver and Gieseke, Fabian and Heinermann, Justin and Poloczek, Jendrik and Treiber, Nils Andr{\'{e}}}, editor = {Woon, Wei Lee and Aung, Zeyar and Madnick, Stuart E.}, title = {A Framework for Data Mining in Wind Power Time Series}, booktitle = {Data Analytics for Renewable Energy Integration - Second {ECML} {PKDD} Workshop, {DARE} 2014, Nancy, France, September 19, 2014, Revised Selected Papers}, series = {Lecture Notes in Computer Science}, volume = {8817}, pages = {97--107}, publisher = {Springer}, year = {2014}, doi = {10.1007/978-3-319-13290-7\_8}, tags = {application,energy}, }
Finding New High-Redshift Quasars by Asking the Neighbours

K. L. Polsterer, P. Zinn, and F. Gieseke

Monthly Notices of the Royal Astronomical Society (MNRAS) 2013

Abs Bib HTML PDF

Quasars with a high redshift (z) are important to understand the evolution processes of galaxies in the early Universe. However, only a few of these distant objects are known to this date. The costs of building and operating a 10-m class telescope limit the number of facilities and, thus, the available observation time. Therefore, an efficient selection of candidates is mandatory. This paper presents a new approach to select quasar candidates with high redshift (z > 4.8) based on photometric catalogues. We have chosen to use the z > 4.8 limit for our approach because the dominant Lyman α emission line of a quasar can only be found in the Sloan i- and z-band filters. As part of the candidate selection approach, a photometric redshift estimator is presented, too. Three of the 120 000 generated candidates have been spectroscopically analysed in follow-up observations and a new z = 5.0 quasar was found. This result is consistent with the estimated detection ratio of about 50 per cent and we expect 60 000 high-redshift quasars to be part of our candidate sample. The created candidates are available for download at MNRAS or at http://www.astro.rub.de/polsterer/quasar-candidates.csv.
@article{PolstererZG2013, author = {Polsterer, Kai Lars and Zinn, Peter and Gieseke, Fabian}, title = {Finding New High-Redshift Quasars by Asking the Neighbours}, journal = {Monthly Notices of the Royal Astronomical Society (MNRAS)}, year = {2013}, volume = {428}, number = {1}, pages = {226-235}, publisher = {Oxford University Press}, tags = {application}, }
Learning morphological maps of galaxies with unsupervised regression

O. Kramer, F. Gieseke, and K. L. Polsterer

Expert Systems with Applications 2013

Abs Bib HTML

Hubble’s morphological classification of galaxies has found broad acceptance in astronomy since decades. Numerous extensions have been proposed in the past, mostly based on galaxy prototypes. In this work, we automatically learn morphological maps of galaxies with unsupervised machine learning methods that preserve neighborhood relations and data space distances. For this sake, we focus on a stochastic variant of unsupervised nearest neighbors (UNN) for arranging galaxy prototypes on a two-dimensional map. UNN regression is the unsupervised counterpart of nearest neighbor regression for dimensionally reduction. In the experimental part of this article, we visualize the embeddings and compare the learning results achieved by various UNN parameterizations and related dimensionality reduction methods.
@article{KramerGP13, author = {Kramer, Oliver and Gieseke, Fabian and Polsterer, Kai Lars}, title = {Learning morphological maps of galaxies with unsupervised regression}, journal = {Expert Systems with Applications}, volume = {40}, number = {8}, pages = {2841--2844}, year = {2013}, doi = {10.1016/j.eswa.2012.12.002}, tags = {application}, }
Wind energy prediction and monitoring with neural computation

O. Kramer, F. Gieseke, and B. Satzger

Neurocomputing 2013

Abs Bib HTML

Wind energy has an important part to play as renewable energy resource in a sustainable world. For a reliable integration of wind energy high-dimensional wind time-series have to be analyzed. Fault analysis and prediction are an important aspect in this context. The objective of this work is to show how methods from neural computation can serve as forecasting and monitoring techniques, contribut- ing to a successful integration of wind into sustainable and smart energy grids. We will employ support vector regression as prediction method for wind energy time-series. Furthermore, we will use dimension reduction techniques like self-organizing maps for monitoring of high-dimensional wind time-series. The methods are briefly introduced, related work is presented, and experimental case studies are exemplarily described. The experimental parts are based on real wind energy time-series data from the National Renewable Energy Laboratory (NREL) western wind resource data set
@article{KramerGS13, author = {Kramer, Oliver and Gieseke, Fabian and Satzger, Benjamin}, title = {Wind energy prediction and monitoring with neural computation}, journal = {Neurocomputing}, volume = {109}, pages = {84--93}, year = {2013}, doi = {10.1016/j.neucom.2012.07.029}, tags = {energy,application}, }
From Supervised to Unsupervised Support Vector Machines and Applications in Astronomy

F. Gieseke

KI 2013

Abs Bib HTML

Support vector machines are among the most popular techniques in machine learning. Given sufficient la- beled data, they often yield excellent results. However, for a variety of real-world tasks, the acquisition of sufficient la- beled data can be very time-consuming; unlabeled data, on the other hand, can often be obtained easily in huge quan- tities. Semi-supervised support vector machines try to take advantage of these additional unlabeled patterns and have been successfully applied in this context. However, they induce a hard combinatorial optimization problem. In this work, we present two optimization strategies that address this task and evaluate the potential of the resulting imple- mentations on real-world data sets, including an example from the field of astronomy.
@article{Gieseke13, author = {Gieseke, Fabian}, title = {From Supervised to Unsupervised Support Vector Machines and Applications in Astronomy}, journal = {{KI}}, volume = {27}, number = {3}, pages = {281--285}, year = {2013}, doi = {10.1007/s13218-013-0248-1}, tags = {ml,de,application}, }
Polynomial Runtime Bounds for Fixed-Rank Unsupervised Least-Squares Classification

F. Gieseke, T. Pahikkala, and C. Igel

ACML13 Asian Conference on Machine Learning, ACML 2013, Canberra, ACT, Australia, November 13-15, 2013 2013

Abs Bib HTML PDF

Maximum margin clustering can be regarded as the direct extension of support vector machines to unsupervised learning scenarios. The goal is to partition unlabeled data into two classes such that a subsequent application of a support vector machine would yield the overall best result (with respect to the optimization problem associated with support vector machines). While being very appealing from a conceptual point of view, the combinatorial nature of the induced optimization problem ren- ders a direct application of this concept difficult. In order to obtain efficient optimization schemes, various surrogates of the original problem definition have been proposed in the literature. In this work, we consider one of these variants, called unsupervised regularized least-squares classifica- tion, which is based on the square loss, and develop polynomial upper runtime bounds for the induced combinatorial optimization task. In particular, we show that for n patterns and kernel ma- trix of fixed rank r (with given eigendecomposition), one can obtain an optimal solution in O(nr) time for r ≤ 2 and in O(nr−1) time for r ≥ 3. The algorithmic framework is based on an inter- esting connection to the field of quadratic zero-one programming and permits the computation of exact solutions for the more general case of non-linear kernel functions in polynomial time.
@inproceedings{GiesekePI13, author = {Gieseke, Fabian and Pahikkala, Tapio and Igel, Christian}, editor = {Ong, Cheng Soon and Ho, Tu Bao}, title = {Polynomial Runtime Bounds for Fixed-Rank Unsupervised Least-Squares Classification}, booktitle = {Asian Conference on Machine Learning, {ACML} 2013, Canberra, ACT, Australia, November 13-15, 2013}, series = {{JMLR} Workshop and Conference Proceedings}, volume = {29}, pages = {62--71}, publisher = {JMLR.org}, year = {2013}, tags = {ml}, }
Towards Non-linear Constraint Estimation for Expensive Optimization

F. Gieseke, and O. Kramer

Applications of Evolutionary Computation - 16th European Conference, EvoApplications 2013, Vienna, Austria, April 3-5, 2013. Proceedings 2013

Abs Bib HTML

Constraints can render a numerical optimization problem much more difficult to address. In many real-world optimization appli- cations, however, such constraints are not explicitly given. Instead, one has access to some kind of a “black-box” that represents the (unknown) constraint function. Recently, we proposed a fast linear constraint esti- mator that was based on binary search. This paper extends these results by (a) providing an alternative scheme that resorts to the effective use of support vector machines and by (b) addressing the more general task of non-linear decision boundaries. In particular, we make use of active learning strategies from the field of machine learning to select reasonable training points for the recurrent application of the classifier. We compare both constraint estimation schemes on linear and non-linear constraint functions, and depict opportunities and pitfalls concerning the effective integration of such models into a global optimization process
@inproceedings{GiesekeK13, author = {Gieseke, Fabian and Kramer, Oliver}, editor = {Esparcia{-}Alc{\'{a}}zar, Anna Isabel}, title = {Towards Non-linear Constraint Estimation for Expensive Optimization}, booktitle = {Applications of Evolutionary Computation - 16th European Conference, EvoApplications 2013, Vienna, Austria, April 3-5, 2013. Proceedings}, series = {Lecture Notes in Computer Science}, volume = {7835}, pages = {459--468}, publisher = {Springer}, year = {2013}, tags = {ml}, }
On GPU-Based Nearest Neighbor Queries for Large-Scale Photometric Catalogs in Astronomy

J. Heinermann, O. Kramer, K. L. Polsterer, and F. Gieseke

KI 2013: Advances in Artificial Intelligence - 36th Annual German Conference on AI, Koblenz, Germany, September 16-20, 2013. Proceedings 2013

Abs Bib HTML

Nowadays astronomical catalogs contain patterns of hundreds of millions of objects with data volumes in the terabyte range. Upcoming projects will gather such patterns for several billions of objects with peta- and exabytes of data. From a machine learning point of view, these settings often yield unsupervised, semi-supervised, or fully supervised tasks, with large training and huge test sets. Recent studies have demonstrated the effectiveness of prototype-based learning schemes such as simple nearest neighbor models. However, although being among the most computationally efficient methods for such settings (if implemented via spatial data structures), applying these models on all remaining patterns in a given catalog can easily take hours or even days. In this work, we investigate the practical effectiveness of GPU-based approaches to accelerate such nearest neighbor queries in this context. Our experiments indicate that carefully tuned implementations of spatial search structures for such multi-core devices can significantly reduce the practical runtime. This renders the resulting frameworks an important algorithmic tool for current and upcoming data analyses in astronomy.
@inproceedings{HeinermannKPG13, author = {Heinermann, Justin and Kramer, Oliver and Polsterer, Kai Lars and Gieseke, Fabian}, editor = {Timm, Ingo J. and Thimm, Matthias}, title = {On GPU-Based Nearest Neighbor Queries for Large-Scale Photometric Catalogs in Astronomy}, booktitle = {{KI} 2013: Advances in Artificial Intelligence - 36th Annual German Conference on AI, Koblenz, Germany, September 16-20, 2013. Proceedings}, series = {Lecture Notes in Computer Science}, volume = {8077}, pages = {86--97}, publisher = {Springer}, year = {2013}, doi = {10.1007/978-3-642-40942-4\_8}, tags = {hpc}, }
Evolutionary kernel density regression

O. Kramer, and F. Gieseke

Expert Systems with Applications 2012

Abs Bib HTML

The Nadaraya–Watson estimator, also known as kernel regression, is a density-based regression technique. It weights output values with the relative densities in input space. The density is measured with kernel functions that depend on bandwidth parameters. In this work we present an evolutionary bandwidth optimizer for kernel regression. The approach is based on a robust loss function, leave-one-out cross-validation, and the CMSA-ES as optimization engine. A variant with local parameterized Nadaraya–Watson models enhances the approach, and allows the adaptation of the model to local data space characteristics. The unsupervised counterpart of kernel regression is an approach to learn principal manifolds. The learning problem of unsupervised kernel regression (UKR) is based on optimizing the latent variables, which is a multimodal problem with many local optima. We propose an evolutionary framework for optimization of UKR based on scaling of initial local linear embedding solutions, and minimization of the cross-validation error. Both methods are analyzed experimentally.
@article{KramerG12, author = {Kramer, Oliver and Gieseke, Fabian}, title = {Evolutionary kernel density regression}, journal = {Expert Systems with Applications}, volume = {39}, number = {10}, pages = {9246--9254}, year = {2012}, doi = {10.1016/j.eswa.2012.02.080}, tags = {de}, }
Efficient recurrent local search strategies for semi- and unsupervised regularized least-squares classification

F. Gieseke, O. Kramer, A. Airola, and T. Pahikkala

Evolutionary Intelligence 2012

Abs Bib HTML

Binary classification tasks are among the most important ones in the field of machine learning. One prominent approach to address such tasks are support vector machines which aim at finding a hyperplane separating two classes well such that the induced distance between the hyperplane and the patterns is maximized. In general, sufficient labeled data is needed for such classification settings to obtain reasonable models. However, labeled data is often rare in real-world learning scenarios while unlabeled data can be obtained easily. For this reason, the concept of support vector machines has also been extended to semi- and unsupervised settings: in the unsupervised case, one aims at finding a partition of the data into two classes such that a subsequent application of a support vector machine leads to the best overall result. Similarly, given both a labeled and an unlabeled part, semi-supervised support vector machines favor decision hyperplanes that lie in a low density area induced by the unlabeled training patterns, while still considering the labeled part of the data. The associated optimization problems for both the semi- and unsupervised case, however, are of combinatorial nature and, hence, difficult to solve. In this work, we present efficient implementations of simple local search strategies for (variants of) the both cases that are based on matrix update schemes for the intermediate candidate solutions. We evaluate the performances of the resulting approaches on a variety of artificial and real-world data sets. The results indicate that our approaches can successfully incorporate unlabeled data. (The unsupervised case was originally proposed by Gieseke F, Pahikkala et al. (2009). The derivations presented in this work are new and comprehend the old ones (for the unsupervised setting) as a special case.)
@article{GiesekeKAP12, author = {Gieseke, Fabian and Kramer, Oliver and Airola, Antti and Pahikkala, Tapio}, title = {Efficient recurrent local search strategies for semi- and unsupervised regularized least-squares classification}, journal = {Evolutionary Intelligence}, volume = {5}, number = {3}, pages = {189--205}, year = {2012}, doi = {10.1007/s12065-012-0068-5}, tags = {ml} }
Resilient k-d trees: k-means in space revisited

F. Gieseke, G. Moruz, and J. Vahrenhold

Frontiers of Computer Science (ICDM 2010 Special Issue) 2012

Abs Bib HTML

We propose a k-d tree variant that is resilient to a pre-described number of memory corruptions while still using only linear space. While the data structure is of independent interest, we demonstrate its use in the context of high-radiation environments. Our experimental evaluation demonstrates that the resulting approach leads to a significantly higher resiliency rate compared to previous results. This is especially the case for large-scale multi-spectral satellite data, which renders the proposed approach well-suited to operate aboard today’s satellites.
@article{GiesekeMV12, author = {Gieseke, Fabian and Moruz, Gabriel and Vahrenhold, Jan}, title = {Resilient k-d trees: k-means in space revisited}, journal = {Frontiers of Computer Science (ICDM 2010 Special Issue)}, volume = {6}, number = {2}, pages = {166--178}, year = {2012}, doi = {10.1007/s11704-012-2870-8}, tags = {de} }
Von überwachten zu unüberwachten Support-Vektor-Maschinen und Anwendungen in der Astronomie

F. Gieseke

Ausgezeichnete Informatikdissertationen 2012 2012

Abs Bib PDF

Ein bekanntes Problem des maschinellen Lernens ist die Klassifikation von Objekten. Entsprechende Modelle basieren dabei meist auf Trainingsdaten, welche aus Mustern mit zugehörigen Labeln bestehen. Die Erstellung eines hinreichend großen Datensatzes kann sich für gewisse Anwendungsfälle jedoch als sehr kosten- oder zeitintensiv erweisen. Eine aktuelle Forschungsrichtung des maschinellen Lernens zielt auf die Verwendung von (zusätzlichen) ungelabelten Mustern ab, welche oft ohne großen Aufwand gewonnen werden können. In diesem Beitrag wird die Erweiterung von sogenannten Support-Vektor-Maschinen auf solche Lernszenarien beschrieben. Im Gegensatz zu Support-Vektor-Maschinen führen diese Varianten jedoch zu kombinatorischen Optimierungsproblemen. Die Entwicklung effizienter Optimierungsstrategien ist daher ein erstrebenswertes Ziel und soll im Rahmen dieses Beitrags diskutiert werden. Weiterhin werden mögliche Anwendungsgebiete der entsprechenden Verfahren erläutert, welche sich unter anderem im Bereich der Astronomie wiederfinden.
@inproceedings{Gieseke12, author = {Gieseke, Fabian}, editor = {H{\"{o}}lldobler, Steffen}, title = {Von {\"{u}}berwachten zu un{\"{u}}berwachten Support-Vektor-Maschinen und Anwendungen in der Astronomie}, booktitle = {Ausgezeichnete Informatikdissertationen 2012}, series = {{LNI}}, volume = {{D-13}}, pages = {111--120}, publisher = {{GI}}, year = {2012}, tags = {ml,de,application}, }
Unsupervised Multi-class Regularized Least-Squares Classification

T. Pahikkala, A. Airola, F. Gieseke, and O. Kramer

ICDM12 12th IEEE International Conference on Data Mining, ICDM 2012, Brussels, Belgium, December 10-13, 2012 2012

Abs Bib HTML PDF

Regularized least-squares classification is one of the most promising alternatives to standard support vector machines, with the desirable property of closed-form solutions that can be obtained analytically, and efficiently. While the supervised, and mostly binary case has received tremendous attention in recent years, unsupervised multi-class settings have not yet been considered. In this work we present an efficient implementation for the unsupervised extension of the multi-class regularized least-squares classification framework, which is, to the best of the authors’ knowledge, the first one in the literature addressing this task. The resulting kernel-based framework efficiently combines steepest descent strategies with powerful meta-heuristics for avoiding local minima. The computational efficiency of the overall approach is ensured through the application of matrix algebra shortcuts that render efficient updates of the intermediate can- didate solutions possible. Our experimental evaluation indicates the potential of the novel method, and demonstrates its superior clustering performance over a variety of competing methods on real-world data sets.
@inproceedings{PahikkalaAGK12, author = {Pahikkala, Tapio and Airola, Antti and Gieseke, Fabian and Kramer, Oliver}, editor = {Zaki, Mohammed Javeed and Siebes, Arno and Yu, Jeffrey Xu and Goethals, Bart and Webb, Geoffrey I. and Wu, Xindong}, title = {Unsupervised Multi-class Regularized Least-Squares Classification}, booktitle = {12th {IEEE} International Conference on Data Mining, {ICDM} 2012, Brussels, Belgium, December 10-13, 2012}, pages = {585--594}, publisher = {{IEEE} Computer Society}, year = {2012}, doi = {10.1109/ICDM.2012.71}, tags = {ml}, }
Sparse Quasi-Newton Optimization for Semi-supervised Support Vector Machines

F. Gieseke, A. Airola, T. Pahikkala, and O. Kramer

ICPRAM 2012 - Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods, Volume 1, Vilamoura, Algarve, Portugal, 6-8 February, 2012 2012

Abs Bib HTML PDF

In real-world scenarios, labeled data is often rare while unlabeled data can be obtained in huge quantities. A current research direction in machine learning is the concept of semi-supervised support vector machines. This type of binary classification approach aims at taking the additional information provided by unlabeled patterns into account to reveal more information about the structure of the data and, hence, to yield models with a better classification performance. However, generating these semi-supervised models requires solving difficult optimization tasks. In this work, we present a simple but effective approach to address the induced optimization task, which is based on a special instance of the quasi-Newton family of optimization schemes. The resulting framework can be implemented easily using black box optimization engines and yields excel- lent classification and runtime results on both artificial and real-world data sets that are superior (or at least competitive) to the ones obtained by competing state-of-the-art methods.
@inproceedings{GiesekeAPK12, author = {Gieseke, Fabian and Airola, Antti and Pahikkala, Tapio and Kramer, Oliver}, editor = {Carmona, Pedro Latorre and S{\'{a}}nchez, J. Salvador and Fred, Ana L. N.}, title = {Sparse Quasi-Newton Optimization for Semi-supervised Support Vector Machines}, booktitle = {{ICPRAM} 2012 - Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods, Volume 1, Vilamoura, Algarve, Portugal, 6-8 February, 2012}, pages = {45--54}, publisher = {SciTePress}, year = {2012}, tags = {ml}, }
From supervised to unsupervised support vector machines and applications in astronomy

F. Gieseke

thesis Carl von Ossietzky University of Oldenburg 2011

Abs Bib HTML PDF

A common task in the field of machine learning is the classification of objects. The basis for such a task is usually a training set consisting of patterns and associated class labels. A typical example is, for instance, the automatic classification of stars and galaxies in the field of astronomy. Here, the training set could consist of images and associated labels, which indicate whether a particular image shows a star or a galaxy. For such a learning scenario, one aims at generating models that can automatically classify new, unseen images. In the field of machine learning, various classification schemes have been proposed. One of the most popular ones is the concept of support vector machines, which often yields excellent classification results given sufficient labeled data. However, for a variety of real-world tasks, the acquisition of sufficient labeled data can be quite time-consuming. In contrast to labeled training data, unlabeled one can often be obtained easily in huge quantities. Semi- and unsupervised techniques aim at taking these unlabeled patterns into account to generate appropriate models. In the literature, various ways of extending support vector machines to these scenarios have been proposed. One of these ways leads to combinatorial optimization tasks that are difficult to address. In this thesis, several optimization strategies will be developed for these tasks that (1) aim at solving them exactly or (2) aim at obtaining (possibly suboptimal) candidate solutions in an efficient kind of way. More specifically, we will derive a polynomial-time approach that can compute exact solutions for special cases of both tasks. This approach is among the first ones that provide upper runtime bounds for the tasks at hand and, thus, yield theoretical insights into their computational complexity. In addition to this exact scheme, two heuristics tackling both problems will be provided. The first one is based on least-squares variants of the original tasks whereas the second one relies on differentiable surrogates for the corresponding objective functions. While direct implementations of both heuristics are still computationally expensive, we will show how to make use of matrix operations to speed up their execution. This will result in two optimization schemes that exhibit an excellent classification and runtime performance. Despite these theoretical derivations, we will also depict possible application domains of machine learning methods in astronomy. Here, the massive amount of data given for today’s and future projects renders a manual analysis impossible and necessitates the use of sophisticated techniques. In this context, we will derive an efficient way to preprocess spectroscopic data, which is based on an adaptation of support vector machines, and the benefits of semi-supervised learning schemes for appropriate learning tasks will be sketched. As a further contribution to this field, we will propose the use of so-called resilient algorithms for the automatic data analysis taking place aboard today’s spacecrafts and will demonstrate their benefits in the context of clustering hyperspectral image data.
@phdthesis{Gieseke2011, author = {Gieseke, Fabian}, title = {From supervised to unsupervised support vector machines and applications in astronomy}, school = {Carl von Ossietzky University of Oldenburg}, year = {2011}, tags = {ml,de,application} }
Analysis of wind energy time series with kernel methods and neural networks

O. Kramer, and F. Gieseke

Seventh International Conference on Natural Computation, ICNC 2011, Shanghai, China, 26-28 July, 2011 2011

Abs Bib HTML

Wind energy has an important part to play as renewable energy resource in a sustainable world. For a reliable integration of wind energy the volatile nature of wind has to be understood. This article shows how kernel methods and neural networks can serve as modeling, forecasting and monitoring techniques, and, how they contribute to a successful integration of wind into smart energy grids. First, we will employ kernel density estimation for modeling of wind data. Kernel density estimation allows a statistically sound modeling of time series data. The corresponding experiments are based on real data of wind energy time series from the NREL western wind resource dataset. Second, we will show how prediction of wind energy can be accomplished with the help of support vector regression. Last, we will use self-organizing feature maps to map high-dimensional wind time series to colored sequences that can be used for error detection.
@inproceedings{KramerG11, author = {Kramer, Oliver and Gieseke, Fabian}, editor = {Ding, Yongsheng and Wang, Haiying and Xiong, Ning and Hao, Kuangrong and Wang, Lipo}, title = {Analysis of wind energy time series with kernel methods and neural networks}, booktitle = {Seventh International Conference on Natural Computation, {ICNC} 2011, Shanghai, China, 26-28 July, 2011}, pages = {2381--2385}, publisher = {{IEEE}}, year = {2011}, doi = {10.1109/ICNC.2011.6022597}, tags = {application,energy} }

Speedy Local Search for Semi-Supervised Regularized Least-Squares

F. Gieseke, O. Kramer, A. Airola, and T. Pahikkala

KI 2011: Advances in Artificial Intelligence, 34th Annual German Conference on AI, Berlin, Germany, October 4-7,2011. Proceedings 2011

Bib HTML PDF

@inproceedings{GiesekeKAP11,
  author = {Gieseke, Fabian and Kramer, Oliver and Airola, Antti and Pahikkala, Tapio},
  editor = {Bach, Joscha and Edelkamp, Stefan},
  title = {Speedy Local Search for Semi-Supervised Regularized Least-Squares},
  booktitle = {{KI} 2011: Advances in Artificial Intelligence, 34th Annual German
                 Conference on AI, Berlin, Germany, October 4-7,2011. Proceedings},
  series = {Lecture Notes in Computer Science},
  volume = {7006},
  pages = {87--98},
  publisher = {Springer},
  year = {2011},
  doi = {10.1007/978-3-642-24455-1\_8},
  tags = {ml}
}

Variance Scaling for EDAs Revisited

O. Kramer, and F. Gieseke

KI 2011: Advances in Artificial Intelligence, 34th Annual German Conference on AI, Berlin, Germany, October 4-7,2011. Proceedings 2011

Abs Bib HTML

Estimation of distribution algorithms (EDAs) are derivative-free optimization approaches based on the successive estimation of the probability density function of the best solutions, and their subsequent sampling. It turns out that the success of EDAs in numerical optimization strongly depends on scaling of the variance. The contribution of this paper is a comparison of various adaptive and self-adaptive variance scaling techniques for a Gaussian EDA. The analysis includes: (1) the Gaussian EDA without scaling, but different selection pressures and population sizes, (2) the variance adaptation technique known as Silverman’s rule-of-thumb, (3) σ-self-adaptation known from evolution strategies, and (4) transformation of the solution space by estimation of the Hessian. We discuss the results for the sphere function, and its constrained counterpart.
@inproceedings{Kramer2011, author = {Kramer, Oliver and Gieseke, Fabian}, title = {Variance Scaling for EDAs Revisited}, booktitle = {{KI} 2011: Advances in Artificial Intelligence, 34th Annual German Conference on AI, Berlin, Germany, October 4-7,2011. Proceedings}, year = {2011}, editor = {Bach, Joscha and Edelkamp, Stefan}, volume = {7006}, series = {Lecture Notes in Computer Science}, pages = {169--178}, publisher = {Springer}, doi = {10.1007/978-3-642-24455-1\_16}, tags = {application}, }
Short-Term Wind Energy Forecasting Using Support Vector Regression

O. Kramer, and F. Gieseke

Soft Computing Models in Industrial and Environmental Applications, 6th International Conference SOCO 2011, 6-8 April, 2011, Salamanca, Spain 2011

Abs Bib HTML

Wind energy prediction has an important part to play in a smart energy grid for load balancing and capacity planning. In this paper we explore, if wind measurements based on the existing infrastructure of windmills in neighbored wind parks can be learned with a soft computing approach for wind energy prediction in the ten-minute to six-hour range. For this sake we employ Support Vector Regression (SVR) for time series forecasting, and run experimental analyses on real-world wind data from the NREL western wind resource dataset. In the experimental part of the paper we concentrate on loss function parameterization of SVR. We try to answer how far ahead a reliable wind forecast is possible, and how much information from the past is necessary.We demonstrate the capabilities of SVR-based wind energy forecast on the micro-scale level of one wind grid point, and on the larger scale of a whole wind park.
@inproceedings{Kramer2011a, author = {Kramer, Oliver and Gieseke, Fabian}, title = {Short-Term Wind Energy Forecasting Using Support Vector Regression}, booktitle = {Soft Computing Models in Industrial and Environmental Applications, 6th International Conference {SOCO} 2011, 6-8 April, 2011, Salamanca, Spain}, year = {2011}, editor = {Corchado, Emilio and Sn{\'{a}}sel, V{\'{a}}clav and Sedano, Javier and Hassanien, Aboul Ella and Calvo{-}Rolle, Jos{\'{e}} Lu{\'{\i}}s and Slezak, Dominik}, volume = {87}, series = {Advances in Intelligent and Soft Computing}, pages = {271--280}, publisher = {Springer}, doi = {10.1007/978-3-642-19644-7\_29}, tags = {application,energy}, }
Pruning spanners and constructing well-separated pair decompositions in the presence of memory hierarchies

F. Gieseke, J. Gudmundsson, and J. Vahrenhold

Journal of Discrete Algorithms (JDA) 2010

Abs Bib HTML

Given a geometric graph G = (S, E) in Rd with constant dilation t, and a positive constant ε, we show how to construct a (1 + ε )-spanner of G with O(|S|) edges using O(sort(|E|)) memory transfers in the cache-oblivious model of computation. The main building block of our algorithm, and of independent interest in itself, is a new cache- oblivious algorithm for constructing a well-separated pair decomposition which builds such a data structure for a given point set S ⊂ Rd using O(sort(|S|)) memory transfers.
@article{GiesekeGV10, author = {Gieseke, Fabian and Gudmundsson, Joachim and Vahrenhold, Jan}, title = {Pruning spanners and constructing well-separated pair decompositions in the presence of memory hierarchies}, journal = {Journal of Discrete Algorithms (JDA)}, volume = {8}, number = {3}, pages = {259--272}, year = {2010}, url = {https://doi.org/10.1016/j.jda.2010.03.001}, doi = {10.1016/j.jda.2010.03.001}, tags = {de}, }

Resilient K-d Trees: K-Means in Space Revisited

F. Gieseke, G. Moruz, and J. Vahrenhold

ICDM10 ICDM 2010, The 10th IEEE International Conference on Data Mining, Sydney, Australia, 14-17 December 2010 2010

Abs Bib HTML PDF

We develop a k-d tree variant that is resilient to a pre-described number of memory corruptions while still using only linear space. We show how to use this data structure in the context of clustering in high-radiation environments and demonstrate that our approach leads to a significantly higher resiliency rate compared to previous results.

@inproceedings{GiesekeMV10,
  author = {Gieseke, Fabian and Moruz, Gabriel and Vahrenhold, Jan},
  editor = {Webb, Geoffrey I. and Liu, Bing and Zhang, Chengqi and Gunopulos, Dimitrios and Wu, Xindong},
  title = {Resilient K-d Trees: K-Means in Space Revisited},
  booktitle = {{ICDM} 2010, The 10th {IEEE} International Conference on Data Mining,
                 Sydney, Australia, 14-17 December 2010},
  pages = {815--820},
  publisher = {{IEEE} Computer Society},
  year = {2010},
  doi = {10.1109/ICDM.2010.94},
  tags = {de},
}

Detecting Quasars in Large-Scale Astronomical Surveys

F. Gieseke, K. L. Polsterer, A. Thom, P. Zinn, D. Bomanns, R. Dettmar, O. Kramer, and J. Vahrenhold

The Ninth International Conference on Machine Learning and Applications, ICMLA 2010, Washington, DC, USA, 12-14 December 2010 2010

Abs Bib HTML PDF

We present a classification-based approach to identify quasi-stellar radio sources (quasars) in the Sloan Digital Sky Survey and evaluate its performance on a manually labeled training set. While reasonable results can already be obtained via approaches working only on photometric data, our experiments indicate that simple but problem-specific features extracted from spectroscopic data can significantly improve the classification performance. Since our approach works orthogonal to existing classification schemes used for building the spectroscopic catalogs, our classification results are well suited for a mutual assessment of the approaches’ accuracies.
@inproceedings{GiesekePTZBDKV10, author = {Gieseke, Fabian and Polsterer, Kai Lars and Thom, Andreas and Zinn, Peter and Bomanns, Dominik and Dettmar, Ralf{-}Jurgen and Kramer, Oliver and Vahrenhold, Jan}, editor = {Draghici, Sorin and Khoshgoftaar, Taghi M. and Palade, Vasile and Pedrycz, Witold and Wani, M. Arif and Zhu, Xingquan}, title = {Detecting Quasars in Large-Scale Astronomical Surveys}, booktitle = {The Ninth International Conference on Machine Learning and Applications, {ICMLA} 2010, Washington, DC, USA, 12-14 December 2010}, pages = {352--357}, publisher = {{IEEE} Computer Society}, year = {2010}, url = {https://doi.org/10.1109/ICMLA.2010.59}, doi = {10.1109/ICMLA.2010.59}, tags = {application}, }
Fast evolutionary maximum margin clustering

F. Gieseke, T. Pahikkala, and O. Kramer

ICML09 Proceedings of the 26th Annual International Conference on Machine Learning, ICML 2009, Montreal, Quebec, Canada, June 14-18, 2009 2009

Abs Bib HTML PDF

The maximum margin clustering approach is a recently proposed extension of the concept of support vector machines to the clustering problem. Briefly stated, it aims at finding an optimal partition of the data into two classes such that the margin induced by a subsequent application of a support vector machine is maximal. We propose a method based on stochastic search to address this hard optimization problem. While a direct implementation would be infeasible for large data sets, we present an efficient computa- tional shortcut for assessing the “quality” of intermediate solutions. Experimental results show that our approach outperforms existing methods in terms of clustering accuracy.
@inproceedings{GiesekePK09, author = {Gieseke, Fabian and Pahikkala, Tapio and Kramer, Oliver}, editor = {Danyluk, Andrea Pohoreckyj and Bottou, L{\'{e}}on and Littman, Michael L.}, title = {Fast evolutionary maximum margin clustering}, booktitle = {Proceedings of the 26th Annual International Conference on Machine Learning, {ICML} 2009, Montreal, Quebec, Canada, June 14-18, 2009}, series = {{ACM} International Conference Proceeding Series}, volume = {382}, pages = {361--368}, publisher = {{ACM}}, year = {2009}, doi = {10.1145/1553374.1553421}, tags = {ml}, }
Cache-Oblivious Construction of a Well-Separated Pair Decomposition

F. Gieseke, and J. Vahrenhold

Proceedings of the 25th European Workshop on Computational Geometry 2009

Abs Bib HTML PDF

We present a cache-oblivious algorithm for computing a well-separated pair decomposition of a finite point set S ⊂ Rd using O(sort(|S|)) memory transfers.
@inproceedings{GiesekeV2009, author = {Gieseke, Fabian and Vahrenhold, Jan}, title = {Cache-Oblivious Construction of a Well-Separated Pair Decomposition}, booktitle = {Proceedings of the 25th European Workshop on Computational Geometry}, year = {2009}, pages = {341-344}, tags = {de}, }