Authorized Time Series AI: How Salesforce uses synthetic data to enhance the underlying model

Time series analysis faces significant barriers to data availability, quality and diversity, which are key factors in developing effective fundamental models. Due to regulatory limitations, inherent biases, poor quality, and paired text annotations, actual datasets are often lacking, making it difficult to create a powerful, generalizable time series sequence foundation model (TSFM) and a large language model-based time series model (TSLLMS). This scarcity affects tasks such as prediction, classification, anomaly detection, reasoning and subtitles, limiting the full potential of current AI advancements.

Salesforce AI research solves these challenges by proposing a comprehensive approach to leveraging synthetic data to enhance TSFM and TSLLM. Their recent study, “authorized time series analysis using synthetic data”, proposes a new strategy to use synthetic data to improve model training, evaluation and fine-tune, focus on mitigating bias, increasing dataset diversity and enriching contextual information. By developing an innovative data generation framework and merging synthetic datasets, Salesforce AI aims to drive practical applications of TSFM and TSLLM, especially in sensitive areas such as healthcare and finance, where data sharing is strictly regulated.

The technical cornerstone of Salesforce AI research involves a variety of integrated data generation methods, each involving specific aspects of time series dynamics, such as trends, seasonal patterns, and noise characteristics. For example, the budget method combines linear exponential trends and periodic seasonality with the noise of the Weble distribution, effectively simulates realistic and diverse situations. Similarly, TimesFM integrates a segmented linear trend and autoregressive moving average (ARMA) model with periodic patterns. Another innovative technology, Chronos’ kernelsynth, uses Gaussian process (GPS) as well as linear, periodic and radial basis function (RBF) kernels to generate rich synthetic data sets. These methods enable controlled but diverse synthetic data creation, helping to capture comprehensive real-life time series behavior.

The discovery of the Salesforce team highlights the substantial benefits obtained by synthetic data at multiple stages of model development. In training preprocessing, synthetic datasets provide clear performance enhancements, which is particularly demonstrated in models such as budgeting power, mamba4cast, and timesFM. For example, the predictors estimated entirely on synthetic data show a significant improvement in the zero-camera prediction scenario, while Chronos found an optimal performance improvement by mixing about 10% of the synthetic data with a real-world dataset, except that additional synthetic data due to different less representativeness may reduce the additional synthetic data. In addition, synthetic data also plays a crucial role in evaluation, enabling researchers to accurately evaluate the capabilities of models, understand internal representations, and identify gaps in learning patterns. The moment uses synthetically generated sine waves to evaluate internal embedding and model sensitivity to time series feature changes, demonstrating its effectiveness in capturing subtle trends and frequencies.

This article also addresses the current limitations on comprehensive data usage and identifies areas for future improvements. A key gap is the lack of system integration methods for synthesizing data sets, which suggests the need to strategically identify and populate missing real-world data patterns. Another limitation pointed out is the dominance of statistical methods, prompting calls for the exploration of data-driven generative technologies, such as diffusion models, to augmented reality. Salesforce researchers further highlight the untapped potential of leveraging synthetic data during the fine-tuning phase to more effectively and adaptively address specific domain gaps or modeling weaknesses.

In summary, Salesforce AI research shows that synthetic data provides a powerful tool set to overcome data-related challenges in time series analysis. By systematically integrating high-quality synthetic datasets into all stages of model development, TSFM and TSLLS can achieve enhanced generalization, reduce bias and improve performance of various analytical tasks. Despite limitations such as ensuring realism and alignment, positive advancements and exploration of synthetic data generation methods show great potential. As Salesforce suggests, future research should focus on improving data realism, systematically addressing data gaps, and leveraging iterative, human synthetic data generation processes. These advances can greatly expand the applicability and reliability of time series models, laying a solid foundation for future artificial intelligence innovations.

Check Paper. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 85k+ ml reddit.

Nikhil is an intern consultant at Marktechpost. He is studying for a comprehensive material degree in integrated materials at the Haragpur Indian Technical College. Nikhil is an AI/ML enthusiast and has been studying applications in fields such as biomaterials and biomedical sciences. He has a strong background in materials science, and he is exploring new advancements and creating opportunities for contribution.

Source link

Authorized Time Series AI: How Salesforce uses synthetic data to enhance the underlying model

Recent Posts