- Practice
- Artificial Intelligence
- Statistics and Machine Learning
- Day 5: Computing the Correlation
- Discussions

# Day 5: Computing the Correlation

# Day 5: Computing the Correlation

- WT
JavaNewbie + 1 comment Nice problem. After a few misfires trying to keep track of WAY too many variables, it dawns to create a class to keep track of the sums (create a class Rxy containing sumX, sumX2, sumY, sumY2, sumXY, N, and two methods. One method to add a point and another to calculate rxy).

I wonder how many rows there are in the test cases.

ayushr2 + 0 comments why do you have to keep track of so many variables? there are two formula given. use the simpler one. like this:

import math def std(l, avg, n): ans = 0 for i in l: ans += (avg - i) ** 2 return math.sqrt(ans / (n - 1)) def cof(u, v, avg_u, avg_v, std_u, std_v, n): ans = 0 for i in range(n): ans += u[i] * v[i] ans -= (n * avg_u * avg_v) return ans / ((n - 1) * std_u * std_v) n = int(input()) m = [] p = [] c = [] sum_m = 0 sum_c = 0 sum_p = 0 for _ in range(n): m_, p_, c_ = map(int, input().split('\t')) m.append(m_) p.append(p_) c.append(c_) sum_m += m_ sum_p += p_ sum_c += c_ avg_m = sum_m / n avg_c = sum_c / n avg_p = sum_p / n std_m = std(m, avg_m, n) std_p = std(p, avg_p, n) std_c = std(c, avg_c, n) print(round(cof(m, p, avg_m, avg_p, std_m, std_p, n), 2)) print(round(cof(c, p, avg_c, avg_p, std_c, std_p, n), 2)) print(round(cof(m, c, avg_m, avg_c, std_m, std_c, n), 2))

works perfectly

nishant_275 + 0 comments For help, you can follow up the link:

- HB
jitendarbafna + 0 comments Why I am getting second output wrong.

n = int(input()) #m = p = c = []; m = []; p = []; c = []; for i in range(n): l = [int(x) for x in input().split()]; m.append(l[0]); p.append(l[1]); c.append(l[2]); mm = sum(x for x in m)/n; mp = sum(x for x in p)/n; mc = sum(x for x in c)/n; sdm = int(sum((x-mm)**2 for x in m)/(n-1)); sdm = sdm**(1/2); sdp = int(sum((x-mp)**2 for x in p)/(n-1)); sdp = sdp**(1/2); sdc = int(sum((x-mc)**2 for x in c)/(n-1)); sdc = sdc**(1/2); _mp = sum(x*y for (x,y) in zip(m,p)); _pc = sum(x*y for (x,y) in zip(p,c)); _cm = sum(x*y for (x,y) in zip(m,c)); cor_mp = (_mp-n*mm*mp)/((n-1)*sdm*sdp); cor_pc = (_pc-n*mc*mp)/((n-1)*sdc*sdp); cor_cm = (_cm-n*mm*mc)/((n-1)*sdm*sdc); print(round(cor_mp,2),round(cor_pc,2),round(cor_cm,2),sep = '\n')

nishkalavallabhi + 1 comment My solution in Python3. It was accepted, but I don't understand why putting math = phys = chem = [] instead of three lines gives a wrong answer.

#Denominator def getden(siga,siga2,n): return ((n*siga2)-(siga)**2)**0.5 #Prod Corr calculation def prod_corr(x,y,n): sigx = sum(x) sigy = sum(y) sigxy = sum(x[i]*y[i] for i in range(0,n)) sigx2 = sum([i*i for i in x]) sigy2 = sum([i*i for i in y]) return float((n*sigxy)-(sigx*sigy))/(getden(sigx,sigx2,n)*getden(sigy,sigy2,n)) #Read input n = int(input()) math = [] phys = [] chem = [] for i in range(0,n): temp = [int(j) for j in input().split("\t")] math.append(temp[0]) phys.append(temp[1]) chem.append(temp[2]) print("{:.2f}".format(prod_corr(math,phys,n))) print("{:.2f}".format(prod_corr(phys,chem,n))) print("{:.2f}".format(prod_corr(math,chem,n)))

- SH
hashmishariul + 0 comments Because in math=phys=chem=[], there is only one list and other 2 are just the aliases of the first one.

Jesus_Rangel + 0 comments The trick is to calculate the covariances (Cij, Cii and Cjj)

import math mth = [] phys = [] chem = [] n = int(raw_input()) for _ in range(n): m, p, c = map(int, raw_input().split('\t')) mth.append(m) phys.append(p) chem.append(c) def corr(x, y): covij = 0 covii = 0 covjj = 0 avg_x = (sum(x) * 1.00) / n avg_y = (sum(y) * 1.00) / n for i in range(n): covij += (x[i] * y[i] - avg_x * avg_y) covii += (x[i] ** 2 - avg_x ** 2) covjj += (y[i] ** 2 - avg_y ** 2) return (covij / ((math.sqrt(covii)) * (math.sqrt(covjj)))) print round(corr(mth, phys), 2) print round(corr(phys, chem), 2) print round(corr(mth, chem), 2)

dhawalkapil + 1 comment I am getting time out error for the second test case. Though I have tried using same data in R in my local it works fine and quickly outputs.

I am using a custom correlation function which is

cor_calc<-function(x,y,n) { xy=x*y x2=x^2 y2=y^2 num=n*as.numeric(sum(xy))-as.numeric(as.numeric(sum(x))*as.numeric(sum(y))) den=sqrt(n*sum(x2)-sum(x)^2)*sqrt(n*sum(y2)-sum(y)^2) return (num/den) }

Any help??

dhawalkapil + 0 comments Solved it myself. File reading was causing timeout.

f <- file("stdin") on.exit(close(f)) #Reading Raw Data text_data<-scan(f) #First Row Contains N N <- as.numeric(text_data[1]) #Rest of the rows contains data raw_data<-text_data[2:length(text_data)] raw_data<-matrix(raw_data,ncol=3,nrow=N,byrow=TRUE) raw_data<-data.frame(raw_data)

polettix + 0 comments Although I usually use Perl, this was really something for R. Just wrote the most basic and straightforward solution and it worked immediately...

- JD
jonathandygert + 0 comments I can't seem to get a very efficient solution in Haskell. Is this challenge just inherently expensive? I managed to get 4.99s after ensuring that nothing was calculated twice and minimizing type conversion.

- JS
georgesung + 1 comment FYI in python, if you import numpy, there is a built-in library function to solve this, it is numpy.corrcoef: http://docs.scipy.org/doc/numpy/reference/generated/numpy.corrcoef.html#numpy.corrcoef

jalFaizy + 1 comment But numpy is not supported for this problem right? Its explicitly written "There is no special library support available for this challenge. "

- JS
georgesung + 1 comment numpy is supported for machine learning challenges such as this one, see https://www.hackerrank.com/environment

- R
rashmib + 0 comments numpy doesn't work for this problem (I tried it)

devbabu + 0 comments Hello, I am getting wrong answer for the sample case, but its working for TC 1 (Additional case), can anyone please look into my submission https://www.hackerrank.com/challenges/computing-the-correlation/submissions/code/12919393

Sort 20 Discussions, By:

Please Login in order to post a comment