Machine Learning 소개

  • Supervised Learning : 모든 training sample이 label되어 있고 결과 값도 이에 연관된다.
  • Unsupervised Learning : training sample에 label이 없다. 데이터의 분산과 패턴에 따라서 결과가 정의된다.
  • Reinforcement Learning : 어떤 Action에 대한 Reward를 달리하면서 목표한 결과에 다가서게 하는 방법

 

기계학습의 주요 구성요소

Machine Learning Models

반죽과 같다. 다양한 목적에 따라서 사용될 수 있는 것이다. 문제를 해결하기 위한 코드 단위의 모듈이다.

예) Linear Regression Model

 

Machine Learning Training

model과 data를 이용해서 trained mode을 만들어 낸다. 이 행위를 Motel Training이라고 부른다.

 

Model Inference

Trained 된 Model을 실제로 사용하고 예측가능한 상태

 

기계학습의 5단계

  • 문제 정의
  • Datasets 생성
  • Model Training
  • Model 검증
  • Model 사용

문제정의

어떤 문제를 풀것인가를 정해야함, 주의 해야할 것은 목적과 이유가 정확한 문제를 정의해야 한다는 것이다.

  • 나쁜 정의 : 장사가 잘되려면?
  • 좋은 정의 : 날씨 1도 변화에 따른 아이스크림 판매량 예측

어떤 Machine Learning Task를 선택 할 것인가?

  • Supervised Learning : Categorical Label(classicfication), Continuous Label(Regression)
  • Unsupervised Learning : clustering

 

Datasets 생성

  • Data Collection : SQL을 이용하거나, Scraper 등을 이용한 데이터의 Collection
  • Data Inspection : 적절한 Data가 수집 되었는가? (Outliers, Missing/incomplete Values, Data 변환) 
  • Summary Statics : 통계적 Data의 구조 확인 (Mean, Inner quartile range, standard deviation etc) 
  • Data Visualization : Data의 시각화를 통한 data trend 및 outliers 확인

 

Model Training

  • Splitting Datasets: training 용과 Test용으로 Data를 분리
  • Training Model : loss function을 최소화 하도록 Iterative 업데이트
    • Model Parameters : training algorithm에 대한 설정값, 목적에 따라 변경가
    • Loss function : 목표까지의 거리, 결과값과 예측값의 차이

Model Parameters는 Loss Function의 결과값이 최소화 되도록 Update 진행

  • Hyper Parameters : Model training동안 수정되지 않는 파라메터값, 수동 입력 가능

 

Model 검증

Model의 정확도를 검증한다.

Use-Case에 맞는 정확한 검증을 찾아야 한다.

Supervised

  • Precision
  • recall
  • Log Loss
  • Mean Absolute Error : 실값과 예측값의 절대 평균
  • Hinge Loss
  • Quantile Loss
  • $ R^2 $ : 결과의 총 변동비율을 바탕으로 모델이 얼마나 잘 예측 했는지 측정
  • F1 Score
  • KL Divergence
  • Root mean Square Error : MAE와 동일한 방식이지만 예측값과 많이 달라졌을 경우 Penalty를 주는 방식

Unsupervised

  • Fowlkes-Mallows
  • V-measure
  • silhouette coefficient
  • Rand index
  • Mutual information
  • Completeness
  • Contingency Matrix
  • Homogeneity
  • Pair Confusion Matrix

Neural network

  • Accuracy
  • Precision
  • Confusion matrix
  • ROC Curve
  • False negative rate
  • Log Loss
  • F1 Score

 

Model 사용

 

용어 모음

Bag of words: A technique used to extract features from the text. It counts how many times a word appears in a document (corpus), and then transforms that information into a dataset.

A categorical label has a discrete set of possible values, such as "is a cat" and "is not a cat."

Clustering. Unsupervised learning task that helps to determine if there are any naturally occurring groupings in the data.

CNN: Convolutional Neural Networks (CNN) represent nested filters over grid-organized data. They are by far the most commonly used type of model when processing images.

A continuous (regression) label does not have a discrete set of possible values, which means possibly an unlimited number of possibilities.

Data vectorization: A process that converts non-numeric data into a numerical format so that it can be used by a machine learning model.

Discrete: A term taken from statistics referring to an outcome taking on only a finite number of values (such as days of the week).

FFNN: The most straightforward way of structuring a neural network, the Feed Forward Neural Network (FFNN) structures neurons in a series of layers, with each neuron in a layer containing weights to all neurons in the previous layer.

Hyperparameters are settings on the model which are not changed during training but can affect how quickly or how reliably the model trains, such as the number of clusters the model should identify.

Log loss is used to calculate how uncertain your model is about the predictions it is generating.

Hyperplane: A mathematical term for a surface that contains more than two planes.

Impute is a common term referring to different statistical tools which can be used to calculate missing values from your dataset.

label refers to data that already contains the solution.

loss function is used to codify the model’s distance from this goal

Machine learning, or ML, is a modern software development technique that enables computers to solve problems by using examples of real-world data.

Model accuracy is the fraction of predictions a model gets right. Discrete: A term taken from statistics referring to an outcome taking on only a finite number of values (such as days of the week). Continuous: Floating-point values with an infinite range of possible values. The opposite of categorical or discrete values, which take on a limited number of possible values.

Model inference is when the trained model is used to generate predictions.

model is an extremely generic program, made specific by the data used to train it.

Model parameters are settings or configurations the training algorithm can update to change how the model behaves.

Model training algorithms work through an interactive process where the current model iteration is analyzed to determine what changes can be made to get closer to the goal. Those changes are made and the iteration continues until the model is evaluated to meet the goals.

Neural networks: a collection of very simple models connected together. These simple models are called neurons. The connections between these models are trainable model parameters called weights.

Outliers are data points that are significantly different from others in the same sample.

Plane: A mathematical term for a flat surface (like a piece of paper) on which two points can be joined by a straight line.

Regression: A common task in supervised machine learning.

In reinforcement learning, the algorithm figures out which actions to take in a situation to maximize a reward (in the form of a number) on the way to reaching a specific goal.

RNN/LSTM: Recurrent Neural Networks (RNN) and the related Long Short-Term Memory (LSTM) model types are structured to effectively represent for loops in traditional computing, collecting state while iterating over some object. They can be used for processing sequences of data.

Silhouette coefficient: A score from -1 to 1 describing the clusters found during modeling. A score near zero indicates overlapping clusters, and scores less than zero indicate data points assigned to incorrect clusters. A

Stop words: A list of words removed by natural language processing tools when building your dataset. There is no single universal list of stop words used by all-natural language processing tools.

In supervised learning, every training sample from the dataset has a corresponding label or output value associated with it. As a result, the algorithm learns to predict labels or output values.

Test dataset: The data withheld from the model during training, which is used to test how well your model will generalize to new data.

Training dataset: The data on which the model will be trained. Most of your data will be here.

Transformer: A more modern replacement for RNN/LSTMs, the transformer architecture enables training over larger datasets involving sequences of data.

In unlabeled data, you don't need to provide the model with any kind of label or solution while the model is being trained.

In unsupervised learning, there are no labels for the training data. A machine learning algorithm tries to learn the underlying patterns or distributions that govern the data.

 

 

728x90
반응형

'AI' 카테고리의 다른 글

Sigmoid Function  (0) 2021.07.17
Perceptron Algorithm 코드  (0) 2021.07.17
Object Orient Programing  (0) 2021.07.04
Software Engineering 소개  (0) 2021.07.03
AWS DeepLens 소개  (0) 2021.07.03