Skip to main content

Applying Machine Learning in Ad Tech

Introduction

In recent years, programmatic advertising is been taking over the online advertisement industry. Many companies referred to as DSP (Demand Side Platform) compete for the same ad-slot on the internet. The success of the DSPs to deliver the values to the advertisers is evaluated on the below two criteria:

  1. High Click Through Rate = Number of clicks/ Number of ads shown
  2. High Conversion Rate = Number of conversions (such as a purchase)/ Number of ads shown

To achieve high CTR and Conversion Rates, DSPs heavily rely on using Artificial Intelligence techniques and develop their in-house Machine Learning algorithms. The problem of applying Machine Learning in Adtech is different from the standard problem in many sense. Google's paper Ad Click Prediction: a view from the Trenches and Facebook's paper Practical Lessons from Predicting Clicks on Ads at Facebbok have discussed in details the lessons learned while building an AI for the adtech industry. In the remaining blog post, I will try to summarize the intricacies of applying ML in adtech and how it is tackled in general:

  • Large size of the training vector: Every feature in the ML model is a categorical feature and encoding them into numerical feature explodes the size of the training vector to the order of billions. For example, one of the most important features in the ML model is the publisher website where the ad would be displayed which is a categorical feature and there are millions of publishers so using one-hot encoding would result in a training vector of million entries.

  • Skewness of the training data: Typically, CTRs are much lower than 50% (generally, CTR is around 1-2%)which means that positive examples (clicks) are relatively rare, hence is the problem of skewness in the training data is introduced.

  • Rapid changes in the online ad landscape: The adtech domain is a very dynamic environment where the data distribution changes over time. Facebook conducted an experiment where they trained the model on one day of data and evaluated on the six consecutive days. The results showed that the performance of the model decreases as the delay between the training and test set increases. Thus, it is very important to update the model every few hours to keep it very real-time

  • Per-coordinate learning rate: In most of the standard ML problems, the learning rate is a constant. In adtech, there is a huge imbalance of the number of training instances on each feature. For example, a famous publisher such as cnn.com will be having more users thus more ad-slots compared to a very unknown one thus, our training data will have a huge number of training instances for cnn.com. Therefore, we want to decrease the learning rate for the coordinate as its frequency in the training data increases.

  • Using Progressive Validation rather than cross-validation: Validating the model on the data set which lags the train set by hours or days is not a good idea as we discussed above that the nature of the dataset changes with time. The online log loss is instead a good proxy for the model performance because it measures the performance only on the most recent data before we train on it, this is exactly analogous to what happens when the model is in production. This also ensures that we can use 100% of our data for both training and testing.

  • Relative changes in the metric compared to Absolute metric: Click rates vary from country to country and from ad-slot to ad-slot and therefore the metrics change over the course of a single day. Google experiments indicate that the relative changes(compared to a baseline model) are much more stable over time, therefore, a relative change in logloss is a better metric than the average log loss. We also take care only to compare metrics computed from exactly the same data.

  • Segmented performance metrics instead of aggregated metrics: One of the things that we have to take care while analyzing the performance of the models in adtech is that the aggregated performance metrics may hide effects that are specific to certain sub-populations of the data. For example, a high CTR may, in fact, be caused by a mix of low and high CTR from different ad exchanges. This makes it critical to visualize the performance metrics not only on the aggregate data but also on the various slicing of the data such as per ad exchange, per adgroup, per device type, per publisher.

This blog is the aggregation of all the lessons learnt while working as Data Scientist in the AdTech company. Do let me know if you have any additional comments.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

Building Machine Learning Data Pipeline using Apache Spark

Introduction

Apache Spark is increasingly becoming popular in the field of Data Sciences because of its ability to deal with the huge datasets and the capability to run computations in memory which is particularly useful in iterative tasks such as the training step of the Machine Learning algorithm. As part of the Data Engine team at Sprinklr, I had some experience building the data processing pipeline in Spark. In this blog post, I will try to summarize my learning in simpler, easy to understand terms along with the python code.

Q. Why is Apache Spark a suitable tool for building the ML data pipeline?

Ans. Few years ago, scikit-learn came up with the idea of data pipeline but with the advent of big data, it became very problematic to scale. Spark's data pipeline concept is mostly inspired by the scikit-learn project. It provides the APIs for machine learning algorithms which make it easier to combine multiple algorithms into a single pipeline, or workflow.

Now, I will introduce the key concepts used in the Pipeline API:

DataFrame: It is basically a data structure for storing the data in-memory in a highly efficient way. Dataframe in Spark is conceptually equivalent to a dataframe in R/Python. It can store different data types such a string, vectors, true labels, and predictions. Dataframes can be created from the csv, json and many different file formats stored on the local filesystem, Hadoop HDFS or cloud environment such as AWS S3.

Transformer: It is a method or an algorithm which can transform one DataFrame into another DataFrame. It includes SQL statements, feature transformers and learned ML models. While defining a transformer, you have to specify the column it would operate on and the output column it would append to the input DataFrame. Technically, a Transformer implements a method transform(). E.g., a SQL select statement which would return a new dataframe with only required columns. Another example is a trained ML model which turns a dataframe with feature vectors into a dataframe with predictions.

Estimator: It is an algorithm which can be fit on a DataFrame to produce a Transformer. Technically, an Estimator implements a method fit() which accepts a DataFrame and produces a Model which is a Transformer. For example, a machine learning algorithm is an Estimator which trains on a DataFrame and produces a trained model which is a transformer as it can transform a feature vector into predictions.

Parameter: These are the hyperparameters used during cross-validation phase of the ML pipeline.

Pipeline: A Pipeline is a sequence of PipelineStage (Transformers and Estimators)together to be running in a particular order to specify a Machine Learning workflow. A Pipeline’s stages are specified as an ordered array. For example predicting the price of a house given it's breadth, length, location and age involve several stages:

  • Remove the data points which have all columns as null value - Transformer
  • Create a new feature area - Transformer
  • Learn a prediction model using the feature vectors and the actual price - Estimator
  • Use the learned model to predict the prices - Transformer

A Pipeline is an Estimator in itself. Thus, after a Pipeline’s fit() method is called, it produces a PipelineModel, which is a Transformer.

PipelineModel has the same number of stages as the original Pipeline, but all Estimators in the original Pipeline have become Transformers. This PipelineModel is used at test time.

Let's dive into the Python code using an example mentioned in the Spark's doc here where we are trying to classify a line of text into 0 or 1:

Step 1: Create a DataFrame

# Some import statements
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer

# DataFrame could be created from a csv file or any other sources 
training = spark.createDataFrame([
(0, "a b c d e spark", 1.0),
(1, "b d", 0.0),
(2, "spark f g h", 1.0),
(3, "hadoop mapreduce", 0.0)
], ["id", "text", "label"])

# Let's also create a test dataset which we will use later
test = spark.createDataFrame([
(4, "spark i j k"),
(5, "l m n"),
(6, "spark hadoop spark"),
(7, "apache hadoop")
], ["id", "text"])

Step 2: Specify the transformers and the estimators of the pipeline

# Tokenizer is a transformer which would convert the text column into words using space as a delimeter
tokenizer = Tokenizer(inputCol="text", outputCol="words")

# HashingTF is again a transformer which takes the column "words as an input and creates a new column of a vector" 
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")

# Logistic Regression is an Estimator which would take "features" and "label" as an input and create a trained model
lr = LogisticRegression(maxIter=10, regParam=0.001)

Step 3: Create the pipeline using the transformers and the estimators defined in step 2

pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

Step 4: Call the fit() method on the pipeline to create a PipelineModel

model = pipeline.fit(training)

Step 5: Use the PipelineModel to do the predictions of the test dataset

prediction = model.transform(test)
selected = prediction.select("id", "text", "probability", "prediction")
selected.show()

One of the big benefits of the Machine Learning Data Pipeline in Spark is hyperparameter optimization which I would try to explain in the next blog post. I hope this blog would help you in getting started with Spark for building ML data pipelines.

Click Through Rate Analysis using Spark

Introduction

In recent years, programmatic advertising is been taking over the online advertisement industry. To enable automatic selling and purchasing ad impressions between advertisers and publishers through real-time auctions, Real-Time Bidding (RTB) is quickly becoming the leading method.

In contrast to the traditional online ad market, where a certain amount of impressions is sold at a fixed rate, RTB allows advertisers to bid each impression individually in real time at a cost based on impression-level features. Real-time Bidding (RTB) is a way of transacting media that allows an individual ad impression to be put up for bid in real-time. This is done through a programmatic on-the-spot auction, which is similar to how financial markets operate. RTB allows for Addressable Advertising; the ability to serve ads to consumers directly based on their demographic, psychographic, or behavioral attributes.

Many DSPs (Demand Side Platforms) act as agents for the advertisers and take part in the real-time auction on behalf of them. In order to enable real-time bidding and provide the advertisers with the clicks at the lowest price possible, DSPs develop their own machine learning algorithms using techniques such as hashing trick, feature combinations, stochastic gradient descent etc.

Motivation

Like the standard practice in most of the data science use cases, whenever a new algorithm is developed, they are put into A/B test against the already existing algorithm in the production (at least for few days) in order to do determine which algorithm suits the business metrics better.

Due to the huge volume of bid requests (around a million bid requests per second), the amount of data collected during AB is in the order of 100 GBs. Python's Pandas library has basically all the functionality needed to do the offline analysis of the data collected in terms of CPCs, spend, clicks, CTR, AUC etc.

But, Pandas has a huge problem, it has to load all the dataset in memory in order to run some computations on it. From my experience, Pandas needs the RAM size to be 3 times the size of the dataset and it can not be run into a distributed environment as cluster a of machines. This is where Apache Spark is useful as it can process the datasets whose size is more than the size of the RAM. This blog will not cover the internals of Apache Spark and how it works rather I will jump to how the Pandas CTR Analysis code can be easily converted into spark analysis with few syntax changes.

Migrating to Spark from Pandas

In new versions, Spark started to support Dataframes which is conceptually equivalent to a dataframe in R/Python. Dataframe support in Spark has made it comparatively easy for users to switch to Spark from Pandas using a very similar syntax. In this section, I would jump to coding and show how the CTR analysis that is done in Pandas can be migrated to Spark.

Before I jump into the coding, I would like to introduce some of the keywords used in the code:

Effective CPC: Total money spent / Total number of clicks

Label: It is either 0 or 1 (1 signifies that the click happened and 0 is for no click)

Win Price: The price paid to win the on-spot auction

Bid CPC: The price the advertiser is willing to pay for the impression

CTR: Click Through Rate = Total Number of Clicks / Total Number of Impressions

How is Win Price different from Bid Price?

If an exchange is using First Price Auction, the win pice and the bid price is same but if the exchange is using Second Price Auction, the advertizer with the highest bid price wins but it pays the price equivalent to the second highest bid price hence the win price is less than the bid price.

Setting up notebook and importing libraries

Pandas
import pandas as pd
Spark
import os
import sys
os.environ['SPARK_HOME'] = "/home/spark-2.3.2-bin-hadoop2.7"
os.environ['JAVA_HOME'] = "/home/jdk1.8.0_181"
spark_home = os.environ.get("SPARK_HOME")
sys.path.insert(0, spark_home + "/python")
sys.path.insert(0, os.path.join(spark_home, "python/lib/py4j-0.10.7-src.zip"))

from  pyspark import SparkContext
from pyspark.conf import SparkConf
CLUSTER_URL = "spark://address:7077"
conf = SparkConf()
conf.setMaster(CLUSTER_URL).setAppName("CTR Analysis").set("spark.executor.memory", "120g")
sc = SparkContext(conf=conf)
print(sc.version)

Reading CSV File

Pandas
df = pd.read_csv(data_file_path, names=cols, error_bad_lines=False, warn_bad_lines=True, sep=',')
Spark
from pyspark.sql.types import StructType, StructField
from pyspark.sql.types import DoubleType, IntegerType, StringType
file_location = "hdfs://address:port/hadoop/dataNode/pyspark/data.csv"
df = spark.read.csv(file_location, header=False, schema=schema, sep=",")
# df = spark.read.csv(file_location, header=False, inferSchema=True, sep=",")
# df.cache()

Cleaning data

Pandas
numeric_cols = ['label','win_price','ctr','bid_cpc']
df[numeric_cols] = df[numeric_cols].convert_objects(convert_numeric=True)
df.dropna(inplace=True, subset=numeric_cols)
Spark
numeric_cols = ['label','win_price','ctr','bid_cpc']
df = df.dropna(subset=numeric_cols)

Calculating Spend, CTR, CPC Per Algo

Pandas
data = df.groupby(['algo']).agg({'win_price': np.sum, c_label:{'clicks':np.sum, 'wins':'count'}}).reset_index()
data[('win_price','sum')] = data[('win_price','sum')] / 1000.
data['ecpc'] = data[('win_price','sum')] / data[('label,'clicks')]
data = pd.DataFrame(data.to_records())
data.columns = ['',algo, 'spend', 'number of impressions', 'number of clicks', 'effective cpc']
Spark
from pyspark.sql.functions import udf
import pyspark.sql.functions as f

def divide_by_1000(x):
    return x/1000.0

udfdivide_by_1000 = udf(divide_by_1000, DoubleType())

data_wins = df.groupby(['algo']).agg( {'win_price': 'sum', 'label: 'count'})
data_clicks = df.groupby(['algo']).agg({'label: 'sum'})
# print data_wins.columns
# print data_clicks.columns
# Rename the columns
data_wins = data_wins.withColumnRenamed("sum(win_price)", "win_price").withColumnRenamed("count(label)", "wins")
#     print data_wins.schema
#     data_wins['win_price'] = data_wins.win_price/1000.
data_wins = (data_wins.withColumn("win_price",udfdivide_by_1000("win_price")))
data_clicks = data_clicks.withColumnRenamed("sum(label)", "clicks")
#     print data_wins.columns
# print data_clicks.columns
data = data_wins.join(data_clicks, on = 'algo', how='inner')
data = (data.withColumn("effective cpc", f.col("win_price")/100))

Google Summer of Code 2017

Proposal            Poster            Github

Google Summer of Code 2016 under the Julia Language was an experience of a lifetime which motivated me to apply to this year's program as well. While, my last year project with Convex.jl was hardcore mathematics, this time I wanted to work in the field of Machine Learning (mainly it's application in the field where there was a potential but not much had been explored). I was well aware of the applications of Machine Learning techniques in Computer Vision, Natural Language Processing, Stock Market Predictions but I never heard or read anyone saying that they used Machine Learning in Differential Equations. So, as soon as I read about the project Machine learning tools for classification of qualitative traits of differential equation solutions, I knew what I had to do in the coming summers. It's been awesome 4 months working on my project and I am so thankful to Google Summer of Code 2017 program and The Julia Language for providing me such an incredible opportunity. The thanksgiving note would be incomplete without the mention of my mentor Chris Rackauckas who always helped me whenever I felt like giving up.

In the remaining blog, I will touch upon the technical aspects of my project and my work:

Project Abstract

Differential equation models are widely used in many scientific fields that include engineering, physics and biomedical sciences. The so-called “forward problem” that is the problem of solving differential equations for given parameter values in the differential equation models has been extensively studied by mathematicians, physicists, and engineers. However, the “inverse problem”, the problem of parameter estimation based on the measurements of output variables, has not been well explored using modern optimization and statistical methods. Parameter estimation aims to find the unknown parameters of the model which give the best fit to a set of experimental data. In this way, parameters which cannot be measured directly will be determined in order to ensure the best fit of the model with the experimental results. This will be done by globally minimizing an objective function which measures the quality of the fit. This inverse problem usually considers a cost function to be optimized (such as maximum likelihood). This problem has applications in systems biology, HIV-AIDS study.

The expected outcome of my project was set of a tools for easily classifying parameters using machine learning tooling for users inexperienced with machine learning.

Application - HIV-AIDS Viral Dynamics Study

Studies of HIV dynamics in AIDS research are very important for understanding the pathogenesis of HIV infection and for assessing the potency of antiviral therapies. Ordinary differential equation (ODE) models are proposed to describe the interactions between HIV virus and immune cellular response.

One popular HIV dynamic model can be written as:

HIV-AIDS Dynamics Differential Equation

where (T) is target cells which are assumed to be produced at a constant rate s and which are assumed to die at rate d per cell. Productive infection by virus (V), occurs by virus interacting with target cells at a rate proportional to the product of their densities i.e at rate βVT, where β is called the infection rate constant. Productively infected cells (I) are assumed to die at rate δ per cell. Virus is produced from productively infected cells at rate p per cell and is assumed to either infect new cells or be cleared. In the basic model, loss of virus by cell infection is included in the clearance process and virus is assumed to be cleared by all mechanisms at rate c per virion.

Here we assume that the parameters p and c are known and can be obtained from the literature.

Very similar to any Machine Lerning problem, we approached the problem of parameter estimation of differential equations by 2 ways: the Bayesian approach (where the idea is to find the probability distribution of the parameters) and the optimization approach (where we are interested to know the point estimates of the parameters).

Optimization Approach

My mentor Chris had integrated the LossFunctions.jl which builds L2Loss objective function to estimate the parameters of the differential equations. I started by implementing the Two-stage method which is based on based on the local smoothing approach and a pseudo-least squares (PsLS) principle under a framework of measurement error in regression models.

The following is the comparative study of the above two methods:

Advantages

  1. Computational efficiency
  2. The initial values of the state variables of the differential equations are not required
  3. Providing good initial estimates of the unknown parameters for other computationally-intensive methods to further refine the estimates rapidly
  4. Easing of the convergence problem

Disadvantages

  1. This method does not converge to the global/local minima as the Non-Linear regression does if the cost function is convex/concave.

I also wrote implementations and test cases for supporting different optimizers and algorithms such as: 1. Genetic Algorithm from Evolutionary.jl 2. Stochastic Algorithms from BlackBoxOptim.jl 3. Simulated Annealing, Brent method, Golden Section Search, BFGS algorithm from Optim.jl 4. MathProgBase associated solvers such as IPOPT, NLopt, MOSEK, etc.

Then, I integrated the PenaltyFunctions.jl with DiffEqParamEstim to add regularization to the loss function such as Tikhonov and Lasso regularization to surmount ill-posedness and ill-conditioning of the parameter estimation problem.

Bayesian Approach

Our objective was to translate the ODE described in DifferentialEquations.jl using ParameterizedFunctions.jl into the corresponding Stan (a Markov Chain Monte Carlo Bayesian inference engine) code and use Stan.jl to find the probability distribution of the parameters without writing a single line of Stan code.

The exhaustive list of pull requests can be found here.

I am glad to have been successfully implemented what I proposed.

References

[1].Parameter Estimation for Differential Equation Models Using a Framework of Measurement Error in Regression Models

[2].Parameter estimation: the build_loss_objective #5

[3].Robust and efficient parameter estimation in dynamic models of biological systems

[4].Parameter Estimation for Differential Equations: A Generalized Smoothing Approach

[5].Stan: A probabilistic programming language for Bayesian inference and optimization

[6]. Linking with Stan project #135

One-Class Classification Algorithms

In my previous blog entry, I have put forth a case for when the traditional classification algorithms do not perform well i.e. when the training data has reasonable level of imbalance between the various classes. The problem statement that I was trying to solve had a similar problem of the skewed distribution of the training data. The power company had only created the database of fraudulent customers.

As I mentioned in my previous blog post, there are 2 methods to tackle the case when we have data from only one class:

  1. The first method is an obvious approach to generate an artificial second class and proceed with traditional classification algorithms.

  2. The second one is to modify the existing classification algoriths to learn on the data from only one class. These algorithms are called "one-class classfication algorithms" and include one-class SVM, One-Class K-Means, One-Class K-Nearest Neighbor, and One-Class Gaussian.

In this blog entry, I will elaborate on the second method and explain the mathematics behind the one-class classification algorithms and how it improves over the traditional classification algorithms.

Fundamental difference between Binary and One Class Classification Algorithms

Binary classification algorithms are discriminatory in nature, since they learn to discriminate between classes using all data classes to create a hyperplane(as seen in the fig. below) and use the hyperplane to label a new sample. In case of imbalance between the classes, the discriminatory methods can not be used to their full potential, since by their very nature, they rely on data from all classes to build the hyperplane that separate the various classes.

Binary Classification HyperplaneBinary Classification Hyperplane

Where as the one-class algorithms are based on recognition since their aim is to recognize data from a particular class, and reject data from all other classes. This is accomplished by creating a boundary that encompasses all the data belonging to the target class within itself, so when a new sample arrives the algorithm only has to check whether it lies within the boundary or outside and accordingly classify the sample as belonging to the target class or the outlier.

One-Class Classification BoundaryOne-Class Classification Boundary

Mathematics behind different One-Class Classification Algorithms

In this section, I will explain the mathematics behind different one-class machine learning algorithms by taking a cue from this research paper.

One-Class Gaussian

This is basically a density estimation model. It assumes that the training data are the samples from the Multivariate Normal Population, therefore for a test sample (say z) having n-feaures, the probability of it belonging to the target class can be calculated as follows:

one-class gaussian

where the parameters μ and Σ are the poputation mean and covariance. Hence, the objective function of this machine learning algorithm is to determine the estimates for μ and Σ. Using the method of maximum likelihood estimator, we can show that the sample mean and sample covariance are the unbiased and consistent estimators for population mean and variance respectively. Hence, once we calculate the probability p(z), we can set a threshold to determine whether a new sample is outlier or not.

One-Class Kmeans

In this method, we first classify the training data into k clusters (which is chosen as per our requirements). Then, for a new sample (say z), the distance d(z) is calculated as the minimum distance of the sample from the centroid of all the k clusters. Now if d(z) is less than a particular thereshold (which is again chosen as per our requirements), then the sample belongs to the target class otherwise it is classified as the outlier.

One-Class K-Nearest Neighbor

Let us take note of some notation before we understand the mathematics behind the algorithm.

d(z,y) : distance between two samples z and y

NN(y) : Nearest Neighbor of sample y

Now given a test sample z, we find the nearest neighbor of z from the training data (which is NN(z) = y) and the nearest neighbor of y (which is NN(y)). Now the rule is to classify z as belonging to the target class when:

K-Nearest Neighbor

where the default value of δ is 1 but can be chosen to satisfy our requirements.

In my next blog entry, I will try to explain the most widely used one-class classification algorithm i.e. one-class support vector machines. Stay tuned !

Thank you very much for making it this far.

When does traditional classification algorithm fail?

One of the most intriguing Machine Learning problems that I have come across was during my 3rd year in the college where a startup named Quantta Analytics presented us with the problem statement about a power distribution company which had observed a significant loss of revenue over a period of time. The loss in revenue was mainly becuase of possible power theft by malicious consumers. The power company resorted to periodic vigilance of customers by sampling and creating a database of only the fraudulent customers to curb this practice. The company wanted the vigilance process to be more robust and effective. Hence, the objective of the problem statement was to deploy a machine learning algorithm to provide a list of customers who are likely to commit fraud. So the problem was effectively a classification problem (with a catch!) where given the attributes of a customer, we had to predict where it is likely to commit the electricity theft or not.

In order to appreciate the problem statement, let us first have the review of the well-known famous classification algorithms. Classification problems try to solve the two or multi-class situation. The goal is to distinguish test data between a number of classes, using training data which has samples from all the possible classes and training data has reasonable level of balance between the various classes. Now the question that arise is - At what levels of imbalance does the use of traditional classifiers becomes futile?

This paper has made observation by conducting experiments on various datasets from the UCI repository, and monitoring the performance of the traditional classifiers. I am listing down the most important observations stated in the paper:

  • The performance of the traditional classifiers start to decline when the imbalance between output classes increase and the decline becomes prominent at the ratio 1:2.8.

  • At the ratio 1:10, the performance is so poor for traditional classifiers that it can no longer be trusted.

Figure below shows the initial ratio of the classes present in UCI dataset and the ratio at which the performance of the binary classifiers starts to deteriorate.

Binary Classification Algorithm Performance Table

Now the questions that arise are - what if we have the training data which has imbalance between the classes or data only from one class? Why do we need to study such case? And are there any situations where such data is available?

To answer the latter first, yes there are plenty of situations where we have the data from only one class. Consider a case of a nuclear power plant. We have the measurement of the plant conditions such as temperature, reaction rate when the plant is in working condition. Is it possible to get such measurements in case of an accident? No. Now, if we want to predict the possible scenario of the breakdown of the plant. From the above observations it is sure that the traditional classification algorithms will not perform well. Some other cases are:

  • Detection of oil spill

  • In computational biology to predict microRNA gene target

There are 2 methods to tackle the case when we have data from only one class:

  1. The first method is an obvious approach to generate an artificial second class and proceed with traditional classification algorithms.

  2. The second one is to modify the existing classification algoriths to learn on the data from only one class. These algorithms are called "one-class classfication algorithms" and include one-class SVM, One-Class K-Means, One-Class K-Nearest Neighbor, and One-Class Gaussian.

I will be elaborating on the above two methods in my next blog post. Thank you very much for making it this far.

Is it the end or another beginning ?

Proposal            Blog            Talk            Presentation            Github

It hasn't been long when I was hugging my friends, calling my parents to inform them of my selection to Google Summer of Code, 2016 program and now here I am writing a wrap-up post for the program. I remember wanting to spend my summers working on the project that involved lots of mathematics, some of the computer science and travel as part of the project. I am so thankful to Google Summer of Code, 2016 program and The Julia Language to have made my each and every wish come true. The thanksgiving note would be incomplete without the mention of my mentors Madeleine and Dj who always helped me whenever I felt like giving up.

Now, coming to the technical aspects of my project, it was divide in 3 phases based on the branches of convex programming namely:

1. Support for complex-domain linear programs

During this phase of the project, I extended the present implementation in Convex.jl to provide support for linear programs involving complex variables and the complex coefficients. The technical details of the implementation are described in the blog post here.

2. Support for second order conic programs

In this phase of the project, I in consulation with my mentors, rewrote many the second order cone atoms such as abs, norm to accept complex arguments. We also had an intense discussion on whether redefining the above atoms to accept complex arguments would violate DCP compliance and we came to a conclusion that defining inverse or the square atoms on complex variables neither makes sense nor does it preserve the DCP compliance.

3. Support for Complex Semidefinite programs

The above 2 phases were relatively difficult for us as we had no literature references and the decision we made were solely based on our understanding and intuition. During this phase, we used the mapping in the introductory section of the research paper here to transform a complex semidefinite program to the corresponding real semidefinite program. Presently, I am writing test cases to check the correctness of our implementation.

The exhaustive list of commits can be found here.

I am glad to have been successfully implemented what I proposed. Presently, I am also writing the documentation and examples to demonstrate the usability of my implementation. The project will culminate with a single pull request to the Convex.jl repository as well as the release of a new version of Convex.jl which we plan to do in next few days.

References

[1].Approximation algorithms for MAX-3-CUT and other problems via complex semidefinite programming

[2].Invariant semidefinite programs

[3].Convex Optimization in Julia

[4].Support for complex variables

[5].Add complex variables

Announcing support for complex-domain linear Programs in Convex.jl

I am pleased to announce the support for complex-domain linear Programs (LPs) in Convex.jl. As one of the Google Summer of Code students under The Julia Language, I had proposed to implement the support for complex semidefinite programming. In the first phase of project, I started by tackling the problem of complex-domain LPs where in first subphase, I had announced the support for complex coefficients during JuliaCon'16 and now I take this opportunity to announce the support for complex variables in LPs.

Complex-domain LPs consist of a real linear objective function, real linear inequality constraints, and real and complex linear equality constraints.

In order to enable complex-domain LPs, we came up with these ideas:

  1. We redefined the conic_form! of every affine atom to accept complex arguments.
  2. Every complex variable z was internally represented as z = z1 + i*z2, where z1 and z2 are real.
  3. We introduced two new affine atoms real and imag which return the real and the imaginary parts of the complex variable respectively.
  4. transpose and ctranspose perform differently on complex variables so a new atom CTransposeAtom was created.
  5. A complex-equality constraint RHS = LHS can be decomposed into two corresponding real equalities constraint real(RHS) = real(LHS) and imag(RHS) = imag(LHS)

After above changes were made to the codebase, we wrote two use cases to demonstrate the usability and the correctness of our idea which I am presenting below:

# Importing Packages
Pkg.clone("https://github.com/Ayush-iitkgp/Convex.jl/tree/gsoc2")
using Convex

# Complex LP with real variable"
n = 10 # variable dimension (parameter)
m = 5 # number of constraints (parameter)
xo = rand(n)
A = randn(m,n) + im*randn(m,n)
b = A * xo 
# Declare a real variable
x = Variable(n)
p1 = minimize(sum(x), A*x == b, x>=0) 
# Notice A*x==b is complex equality constraint 
solve!(p1)
x1 = x.value

# Let's now solve by decomposing complex equality constraint into the corresponding real and imaginary part.
p2 = minimize(sum(x), real(A)*x == real(b),      imag(A)*x==imag(b), x>=0)
solve!(p2)
x2 = x.value
x1==x2 # should return true


# Let's now take an example where the complex variable is used
# Complex LP with complex variable
n = 10 # variable dimension (parameter)
m = 5 # number of constraints (parameter)
xo = rand(n)+im*rand(n)
A = randn(m,n) + im*randn(m,n)
b = A * xo

# Declare a complex variable
x = ComplexVariable(n)
p1 = minimize(real(sum(x)), A*x == b, real(x)>=0, imag(x)>=0)
solve!(p1)
x1 = x.value

xr = Variable(n)
xi = Variable(n)
p2 = minimize(sum(xr), real(A)*xr-imag(A)*xi == real(b), imag(A)*xr+real(A)*xi == imag(b), xr>=0, xi>=0)
solve!(p2)
x1== xr.value + im*xi.value # should return true

List of all the affine atoms are as follows:

  1. addition, substraction, multiplication, division
  2. indexing and slicing
  3. k-th diagonal of a matrix
  4. construct diagonal matrix
  5. transpose and ctranspose
  6. stacking
  7. sum
  8. trace
  9. conv
  10. real and imag

Now, I am working towards implementing the complex-domain Second Order Conic Programming. Meanwhile, I invite the Julia community to play around with the complex-domain LPs. The link to the development branch is here.

Looking forward to your suggestions!

Special thanks to my mentors Madeleine and Dj!

About Me

Mathematics itself is too enthralling to resist. More so, when coupled with insights from Computer Science, the satisfaction of learning is remarkable.

I am a Deep Learning Engineer at EventRegistry where I am developing data science pipeline using deep learning based text classification algorithms and deploying it using rest API. I am applying for a Data Science position at Toptal.

Previously, I worked as a Data Scientist at Zemanta, an Outbrain Company based in Ljubljana, Slovenia. My day to day work involves implementing and analyzing the real-time machine algorithms and integrating it with the bidder infrastructure.

Before joining Zemanta, I worked as the Product Engineer, Data Engine at Sprinklr, India where I was involved in implementing the machine learning module in the data flow execution pipeline using Kafka, Spark, and Spring framework.

I graduated from the Indian Institute of Technology(IIT), Kharagpur with majors in Mathematics and Computer Science and a micro-specialization in Optimization Theory and Applications. My interests lie in the intersection of Applied Mathematics and Computer Science. More specifically, I am intrigued by the current trend in the field of Optimization, Data Analytics and Machine Learning and their application in improving our day to day lives.

I have also been involved in the project titled "Machine learning tools for classification of qualitative traits of differential equation solutions" where I developed set of a tools for easily classifying parameters using machine learning tooling for users inexperienced with machine learning under the mentorship of Chris Rackauckas.

As Google Summer of Code 2016 student, I implemented the support for the Complex SemiDefinite Programming in Convex.jl under the guidance of Madeleine Udell and Dvijotham Krishnamurthy. I have also been part of the project involving the application of optimization techniques in field of portfolio optimization.

While trying to keep up with the rapid advancements in technology, I like to spend my free time traveling, hiking, running and playing cricket.

Contact

Gmail: ayushpandey.iitkgp@gmail.com
LinkedIn:  Ayush Pandey
GitHub:  Ayush-iitkgp

Updates

May, 2018: This is a giant leap forward in life, I am moving to a relatively unknown country Slovenia to work as the Data Scientist at Zemanta, an Outbrain Company. I am grateful for this opportunity and look forward to living an expat life.

Jul, 2017: The new phase of life has begun, I am joining Sprinklr as a Product Engineer in their R&D department. I will be starting to work with the Data Engine team where we will be implementing the machine learning module in the data flow execution pipeline.

Jun, 2017: It feel honoured to present my work related to the Applications of Convex.jl in Optimization Involving Complex Numbers at JuliaCon'17 held at University of California, Berkeley. Here is the link to the talk I delivered .

Jun, 2017: The unbelievable journey has finally concluded. I am graduating from the Indian Institute of Technology Kharagpur with majors in Mathematics and Computing Sciences and a micro-specialization in Optimization Theory and Applications.

May, 2017: I will be working with Julia Language to add Machine Learning functionality in DiffEqParamEstim.jl as Google Summer of Code 2017 student. Here is my proposal.

Apr, 2017: I am honoured to receive Institute Blue for my contribution to the game of cricket in IIT Kharagpur at the Annual Gymkhana Awards ceremony.

Aug, 2016: I would be working as Teaching Assitant for the NPTEL online course Fundamental Algorithms: Design and Analysis supervised by Prof. Sourav Mukhopadhyay.

Jul, 2016: Announcing support for complex-domain Linear Programming in Convex.jl.

Jun, 2016: It was great experience meeting with Julia developers at JuliaCon2016 at MIT, Cambridge. Here is the talk I delivered to introduce the community of my work.

May, 2016: Joining Samsung Research Institute, Bangalore as Student Trainee.

Apr, 2016: It gives me immense pleasure to announce that I have been selected under Google Summer of Code, 2016 program to work under The Julia Language on the project titled Adding support for complex-domain convex optimization problems in Convex.jl

Feb, 2016: IIT Kharagpur won the General Championship in the 4th Inter IIT Tech Meet. So proud to be the member of the contingent.

A month of Excitement!

Every day is a new experience and it opening up new opportunities. Life is awesome.

That is how I would describe my first month as GSoCer. From meeting new like minded people from all across the world to having an intense discussion about our projects in different groups to getting excited about receiving google goodies and now finally getting invited to JuliaCon'16 to be held at MIT, Cambridge, Massachusetts. Every experience has been worth the work put behind getting selected into GSoC'16.

I got into coding my project and community bonding as soon as my exams got over on 29th April. Before I get into the details of my project, I would like to extend this opportunity to explain how my project is of importance to the optimization community in particular. The reason is as follows:

Many problems in applied sciences are posed as optimization problems over the complex field such as:

  • Phase retrieval from sparse signals
  • Designing an FIR filter given desired frequency response
  • Optimization problems in AC power systems
  • Frequency domain analysis in signal processing and control theory.

The present approach is to manually convert the complex-domain problems to real-domain problems (example) and pass to solvers. This process can be time-consuming and non-intuitive sometimes. The correct approach to such problem would be to make existing packages deal with complex-domain optimization hence making it easier for the optimization community to deal with complex-domain problems.

In the first hangout call with my mentors, we had decided to implement the functionality to support the Linear complex-domain optimization problem in Convex.jl. In order to support the above functionality, I was required to make the following changes:

  1. Add support for complex variables, Hermitian SemiDefinite matrix and complex constants. This was done by introducing a new sign ComplexSign which was introduced as the subtype of the user-defined type Sign in dcp.jl file. I also went on to define the new rules for basic rules on interactions of mathematical expressions with the mathematical expression with ComplexSign

    abstract Sign
    type Positive <: Sign                   end
    type Negative <: Sign                   end
    type NoSign <: Sign                     end
    type ComplexSign <: Sign                end
    
    -(s::ComplexSign) = ComplexSign()
    +(s::ComplexSign, t::ComplexSign) = s
    +(s::Sign, t::ComplexSign) = t
    +(t::ComplexSign, s::Sign) = s+t
    *(t::ComplexSign, s::ComplexSign) = t
    *(t::ComplexSign, s::Sign) = t
    *(s::Sign, t::ComplexSign) = t
    *(s::ComplexSign, m::Monotonicity) = NoMonotonicity()
    *(m::Monotonicity, s::Sign) = s * m
    *(s::ComplexSign, v::ConstVexity) = v
    *(s::ComplexSign, v::AffineVexity) = v
    *(s::ComplexSign, v::ConvexVexity) = NotDcp()
    *(s::ComplexSign, v::ConcaveVexity) = NotDcp()
    *(s::ComplexSign, v::NotDcp) = v
    
  2. As a result of changes made in point 1, I was able to modify the constant.jl and variable.jl file accordingly.

    ComplexVariable(m::Int, n::Int, sets::Symbol...) = Variable((m,n), ComplexSign(), sets...)
    ComplexVariable(sets::Symbol...) = Variable((1, 1), ComplexSign(), sets...)
    ComplexVariable(size::Tuple{Int, Int}, sets::Symbol...) = Variable(size, ComplexSign(), sets...)
    ComplexVariable(size::Int, sets::Symbol...) = Variable((size, 1), ComplexSign(), sets...)
    HermitianSemidefinite(m::Integer) = ComplexVariable((m,m), :Semidefinite)
    function HermitianSemidefinite(m::Integer, n::Integer)
      if m==n
        return ComplexVariable((m,m), :Semidefinite)
      else
        error("HermitianSemidefinite matrices must be square")
      end
    end
    

    Now, the user can create the complex variables, Hermitian-semidefinite matrix in Convex.jl as follows:

    y = ComplexVariable()
    Variable of
    size: (1, 1)
    sign: Convex.ComplexSign()
    vexity: Convex.AffineVexity()
    z = HermitianSemidefinite(4)
    Variable of
    size: (4, 4)
    sign: Convex.ComplexSign()
    vexity: Convex.AffineVexity()
    
  3. The third step was to redefine the affine atoms in Convex.jl to accommodate the complex variable case as well. Interestingly, by virtue of the rules defined in step 1 in dcp.jl, we didn't even need to define any affine atom to accommodate the complex case, Convex.jl is smart enough to pick the sign of the expression using rules in dcp.jl.

So this was the summary of what I have been doing on the coding side of GSoC. On the other side, I moved to Bangalore (India's Silicon Valley) where I would be spending next two months and I am very excited to meet the other JuliaSoCers at the Julia Computing Bangalore office in the days to come.

Also, the coming week is crucial for us as we have decided to fully implement the complex-domain Linear Programming Problem in Convex.jl. Last night, we had our 3rd hangout meeting where we discussed the nitty-gritty of Convex.jl especially the conic form functionality and how to proceed so as to meet the deadline that we have set up for ourselves. I look forward to another week of coding and making new friends and getting involved in the Open Source Community.

The Bragging Rights- Convex.jl would be the second optimization package after Matlab's cvx to support complex-domain optimization problems, so yes this excites me as well to be a part the team that is developing such a useful package.

Also, I am very very much excited to attend JuliaCon'16 at the Massachusetts Institute of Technology and give a talk on what I have been doing this summer. I hope I get the visa on time, wish me luck everyone :)

See you next time till then keep laughing, spreading happiness and the most importantly living this journey called life!!!!