机器学习查漏补缺(3)

[E] Why does an ML model’s performance degrade in production?

There are several reasons why a machine learning model's performance might degrade in production:

  1. Data drift: The distribution of the input data changes over time (e.g., customer behavior, market conditions), and the model no longer sees the same data it was trained on.
  2. Concept drift: The underlying relationship between inputs and outputs changes (e.g., new trends, evolving patterns).
  3. Training-serving skew: There may be differences between the data used for training and the data seen in production (e.g., preprocessing discrepancies, feature engineering differences).
  4. Model staleness: The model may become outdated as new data becomes available, and the patterns it learned during training no longer apply.
  5. Infrastructure issues: Bugs, misconfigurations, or different hardware environments in production can introduce subtle errors that weren't present during testing.

Problems with Deploying Large Machine Learning Models

[M] What problems might we run into when deploying large machine learning models?

  1. Latency: Large models can be slow to make predictions, which may not meet the real-time requirements of certain applications (e.g., recommendation systems, autonomous driving).
  2. Resource usage: Large models require significant computational resources (e.g., memory, processing power) and may be difficult to deploy on resource-constrained environments (e.g., mobile devices, edge devices).
  3. Scalability: Deploying large models across many servers or users can lead to scalability issues, particularly if the model needs frequent retraining or updates.
  4. Inference costs: High computational requirements translate into higher inference costs in production, especially in cloud environments.
  5. Model interpretability: Large models, such as deep neural networks, are often less interpretable, making it harder to debug issues or explain predictions.
  6. Serving infrastructure complexity: Deploying large models may require specialized serving infrastructure, like GPUs or distributed systems, increasing operational complexity.

Common MCMC algorithms:

  • Metropolis-Hastings: A proposal is made for a new state based on the current state, and it is either accepted or rejected based on a probability that depends on the ratio of the probabilities of the current and proposed states.
  • Gibbs sampling: A special case of MCMC that updates one variable at a time by sampling from its conditional distribution, given the other variables.

MCMC is widely used in Bayesian statistics and machine learning for sampling from posterior distributions.

Sampling from High-Dimensional Data

[M] If you need to sample from high-dimensional data, which sampling method would you choose?

When sampling from high-dimensional data, Markov Chain Monte Carlo (MCMC) methods, such as Hamiltonian Monte Carlo (HMC) or Gibbs sampling, are often preferred. These methods are well-suited for high-dimensional spaces because:

  1. MCMC algorithms can explore complex, multi-dimensional distributions efficiently, particularly when the distribution has high-dimensional dependencies.
  2. Hamiltonian Monte Carlo (HMC) uses gradient information to take larger, more informed steps in high-dimensional spaces, reducing the risk of getting stuck in regions of low probability.

In cases where you need to sample independent, low-variance samples from high-dimensional spaces, importance sampling or rejection sampling can also be considered, although these methods become less effective as dimensionality increases.

Sampling 100K Comments to Label

[M] How would you sample 100K comments to label?

When sampling comments for labeling, the goal is to ensure that the 100K samples are representative of the overall data distribution. Here are some strategies:

  1. Random Sampling: Select 100K comments at random from the pool of 10 million. This ensures every comment has an equal chance of being selected and provides a broad, unbiased sample.

  2. Stratified Sampling: If you suspect that certain user groups or time periods may have different comment behaviors (e.g., some users might post more abusive comments than others, or behavior changes over time), you can stratify the data by user or time and then randomly sample within each stratum to ensure that your sample is representative across these dimensions.

  3. Temporal Sampling: Since the comments span 24 months, consider stratifying the data by time to ensure that comments from all periods are equally represented. This ensures that the model can generalize across different time periods.

  4. User-Based Sampling: Since comments come from 10K users, you might want to ensure that the 100K comments come from a diverse set of users to avoid bias toward frequent users or specific groups.

Best Practice: A combination of stratified and random sampling might be best to ensure representativeness, especially across users and time periods.


Estimating Label Quality from 100K Labeled Comments

[M] Suppose you get back 100K labeled comments from 20 annotators and you want to look at some labels to estimate the quality of the labels. How many labels would you look at? How would you sample them?

To estimate label quality, you should check a subset of the 100K labeled comments. Here’s how to sample and estimate the number:

  1. How many labels to check?

    • A common rule of thumb is to inspect around 1-5% of the labeled data. For 100K labeled comments, you could start by reviewing 1,000 to 5,000 labels.
    • You can also use statistical sampling to ensure a representative sample. For example, with a confidence level of 95% and a margin of error of 2-5%, a sample size of around 400 to 2,500 might be appropriate depending on the distribution of labels.
  2. How to sample them?

    • Random sampling: Select a random subset of labeled comments to get a broad view of the quality across the dataset.
    • Annotator-based sampling: Since you have 20 annotators, it's important to check for annotator bias. Randomly sample labels from each annotator to ensure that no single annotator is consistently inaccurate.
    • Stratified sampling: If the data has certain natural groupings (e.g., by user, topic, or time period), stratify the sampling to ensure that labels from different groups are represented.
    • Disagreement-based sampling: If some comments have been labeled by multiple annotators, focus on comments where there is disagreement between annotators, as these might indicate lower label quality or subjective labeling.

Best Practice: Start with a 1-5% sample of labels and use a mix of random and stratified sampling, with particular attention to annotator consistency.


Problem with Translation Argument

[M] Suppose you work for a news site that historically has translated only 1% of all its articles. Your coworker argues that we should translate more articles into Chinese because translations help with the readership. On average, your translated articles have twice as many views as your non-translated articles. What might be wrong with this argument?

The argument may suffer from selection bias. If the site is translating only 1% of articles, it’s possible that the articles selected for translation are already expected to perform better (e.g., high-interest topics, popular writers, breaking news stories). Therefore, the higher view count may be due to the nature of the articles rather than the translation itself.

To address this issue:

  • You need to control for factors like the topic, author, and publication time of the articles.
  • A more robust test would be to randomly select articles for translation and compare the view counts of translated and non-translated articles of similar types.

Determining if Two Sets of Samples Come from the Same Distribution

[M] How to determine whether two sets of samples (e.g., train and test splits) come from the same distribution?

To determine if two sets of samples come from the same distribution, you can use statistical tests or visual methods:

  1. Kolmogorov-Smirnov (K-S) Test: A non-parametric test that compares the cumulative distributions of two datasets. It’s useful for testing if two samples come from the same continuous distribution.

  2. Chi-Square Test: If the data is categorical, you can use the chi-square test to compare the observed frequencies of categories in the two samples.

  3. Mann-Whitney U Test: Another non-parametric test that compares whether the distributions of two independent samples are different.

  4. Jensen-Shannon Divergence: A measure of the similarity between two probability distributions. A small value indicates that the distributions are similar.

  5. Visual Methods: You can plot histograms, KDEs (Kernel Density Estimations), or cumulative distribution plots for both train and test sets to visually inspect whether they come from the same distribution.

[M] How to determine outliers in your data samples? What to do with them?

Methods for detecting outliers:

  1. Z-score or Standard Deviation Method: An outlier can be defined as a data point that is more than a certain number of standard deviations from the mean (e.g., z>3z > 3z>3).

  2. IQR (Interquartile Range): Outliers are data points that lie below Q1−1.5×IQRQ1 - 1.5 \times \text{IQR}Q1−1.5×IQR or above Q3+1.5×IQRQ3 + 1.5 \times \text{IQR}Q3+1.5×IQR, where Q1Q1Q1 and Q3Q3Q3 are the first and third quartiles.

  3. Isolation Forest: An unsupervised learning algorithm that isolates outliers by recursively partitioning data. Points that require fewer splits to isolate are considered outliers.

  4. DBSCAN (Density-Based Clustering): A clustering algorithm that identifies dense regions in the data and treats points that don’t belong to any cluster as outliers.

  5. Visual Methods: Plotting data with scatter plots, box plots, or histograms can help visually identify outliers.

What to do with outliers:

  • Remove outliers: If they are the result of data entry errors or noise, removing them can improve model performance.
  • Transform outliers: You can apply transformations (e.g., log transformation) to reduce the impact of extreme values.
  • Use robust models: Some models, such as decision trees or median-based regression, are less sensitive to outliers.
  • Imputation: For missing or clearly incorrect values, consider imputing reasonable values based on the rest of the data.

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.xdnf.cn/news/1542756.html

如若内容造成侵权/违法违规/事实不符,请联系一条长河网进行投诉反馈,一经查实,立即删除!

相关文章

脱离枯燥的CRUD,灵活使用Mybatis,根据mybatis动态的xml片段和接口规范动态生成代理类,轻松应付简单业务场景。

需求 需求是这样的,我们有一个数据服务平台的产品,用户先将数据源信息保存到平台上,一个数据源可以提供多个接口服务,而每个接口服务在数据库中存一个具有mybatis语法的sql片段。这样的话,对于一些简单的业务只需要编…

c++进阶学习-----继承

1.继承的概念及定义 1.1继承的概念 继承(inheritance)机制是面向对象程序设计使代码可以复用的最重要的手段,它允许程序员在保持原有类特性的基础上进行扩展,增加功能,这样产生新的类,称派生类。 继承呈现了面向对象 程序设计的…

006——队列

队列: 一种受限的线性表(线性逻辑结构),只允许在一段进行添加操作,在另一端只允许进行删除操作,中间位置不可操作,入队的一端被称为队尾,出队的一端被称为队头,在而我们…

iOS 中 KVC 与 KVO 底层原理

KVC 本质: [object setValue: forKey:];即使没有在.h 文件中有property 的属性声明,setValue:forKey依然会按照上图流程执行代码 KVC 如果成功改变了成员变量,是一定可以被 KVO 监听到成员变量的前后改变的 KVO runtime会生成中间类&…

Leetcode 378. 有序矩阵中第 K 小的元素

1.题目基本信息 1.1.题目描述 给你一个 n x n 矩阵 matrix ,其中每行和每列元素均按升序排序,找到矩阵中第 k 小的元素。 请注意,它是 排序后 的第 k 小元素,而不是第 k 个 不同 的元素。 你必须找到一个内存复杂度优于 O(n^2…

GPT1-GPT3论文理解

GPT1-GPT3论文理解 视频参考:https://www.bilibili.com/video/BV1AF411b7xQ/?spm_id_from333.788&vd_sourcecdb0bc0dda1dccea0b8dc91485ef3e74 1 历史 2017.6 Transformer 2018.6 GPT 2018.10 BERT 2019.2 GPT-2 2020…

ER论文阅读-Decoupled Multimodal Distilling for Emotion Recognition

基本介绍:CVPR, 2023, CCF-A 原文链接:https://openaccess.thecvf.com/content/CVPR2023/papers/Li_Decoupled_Multimodal_Distilling_for_Emotion_Recognition_CVPR_2023_paper.pdf Abstract 多模态情感识别(MER)旨在通过语言、…

闯关leetcode——67. Add Binary

大纲 题目地址内容 解题代码地址 题目 地址 https://leetcode.com/problems/add-binary/description/ 内容 Given two binary strings a and b, return their sum as a binary string. Example 1: Input: a “11”, b “1” Output: “100” Example 2: Input: a “101…

【LeetCode:116. 填充每个节点的下一个右侧节点指针 + BFS(层次遍历)】

🚀 算法题 🚀 🌲 算法刷题专栏 | 面试必备算法 | 面试高频算法 🍀 🌲 越难的东西,越要努力坚持,因为它具有很高的价值,算法就是这样✨ 🌲 作者简介:硕风和炜,…

MFC - 常用基础控件

前言 各位师傅大家好,我是qmx_07,今天给大家讲解MFC中的基础控件 基础控件 单选按钮 绘图准备: 调整窗口大小,设置 radio button 单选按钮button 按钮 设置单选按钮变量分别为 m_BN1、 m_BN2、m_BN3 void CMFCApplication3Dlg::OnBnC…

【笔记】机器学习算法在异常网络流量监测中的应用

先从一些相对简单的综述类看起,顺便学学怎么写摘要相关工作的,边译边学 机器学习算法在异常网络流量监测中的应用 原文:Detecting Network Anomalies in NetFlow Traffic with Machine Learning Algorithms Authors: Quc Vo, Philippe Ea, Os…

C++入门——类的默认成员函数(构造函数)

文章目录 前言一、构造函数二、栈的构造函数总结 前言 ⼀个类,我们不写的情况下编译器会默认⽣成以下6个默认成员函数 默认成员函数很重要,也⽐较复杂,我们要从两个⽅⾯去学习: 第⼀:我们不写时,编译器默认…

Spring后端直接用枚举类接收参数,自定义通用枚举类反序列化器

在使用枚举类做参数时,一般会让前端传数字,后端将数字转为枚举类,当枚举类很多时,很可能不知道这个code该对应哪个枚举类。能不能后端直接使用枚举类接收参数呢,可以,但是受限。 Spring反序列默认使用的是J…

如何用Shell命令结合 正则表达式 统计文本中的ip地址数量

文章目录 简介问题回答 简介 IP 地址(Internet Protocol Address)是互联网协议地址的简称,是互联网上为联网的设备(如计算机、服务器、路由器、手机等)分配的唯一标识符。IP 地址的主要功能是实现不同网络设备之间的通…

[Python]一、Python基础编程(2)

F:\BaiduNetdiskDownload\2023人工智能开发学习路线图\1、人工智能开发入门\1、零基础Python编程 1. 文件操作 把⼀些内容 ( 数据 )存储存放起来,可以让程序下⼀次执⾏的时候直接使⽤,⽽不必重新制作⼀份,省时省⼒ 。 1.1 文件的基本操作 1. 打开文件 2. 读写操作 3. 关闭…

hive-拉链表

目录 拉链表概述缓慢变化维拉链表定义 拉链表的实现常规拉链表历史数据每日新增数据历史数据与新增数据的合并 分区拉链表 拉链表概述 缓慢变化维 通常我们用一张维度表来维护维度信息,比如用户手机号码信息。然而随着时间的变化,某些用户信息会发生改…

【软件工程】需求分析概念

一、定义 二、为什么要进行需求分析? 三、需求分析任务 四、与用户沟通获取需求的方法 五、分析建模 六、软件需求规格说明 例题 选择题

【题解】【枚举,数学】——小 Y 拼木棒

【题解】【枚举,数学】——小 Y 拼木棒 小 Y 拼木棒题目背景题目描述输入格式输出格式输入输出样例输入 #1输出 #1 提示数据规模与约定 1.题意简述2.思路解析3.AC代码 前置知识:排列组合,暴力枚举基础知识。 小 Y 拼木棒 通往洛谷的传送门 …

基于SpringBoot+Vue+MySQL的医院信息管理系统

系统展示 用户前台界面 管理员后台界面 系统背景 在当今社会,随着医疗服务需求的不断增长和医疗信息化的快速发展,提升医院管理效率和服务质量成为了医疗行业的核心需求。传统的医院管理模式面临着效率低下、资源分配不均、患者就医体验差等问题。为了应…

图像处理基础知识点简记

简单记录一下图像处理的基础知识点 一、取样 1、释义 图像的取样就是图像在空间上的离散化处理,即使空间上连续变化的图像离散化, 决定了图像的空间分辨率。 2、过程 简单描述一下图象取样的基本过程,首先用一个网格把待处理的图像覆盖,然后把每一小格上模拟图像的各个…