spark vs hadoop page rank 실행시간 비교

2018. 11. 20. 19:27

total input 30만줄 X 66rows = 1980만 rows = 약 5GB

output = ranks => about 1980만 rows = 약 549MB

환경	작업	condtion1	condition2	condition3	소요시간(m)	결과 파일 개수(row수)
Hadoop/ PageRank	page rank itteration 5	yarn - 2GB	12 Containers	map split by 67files(80MB each)	46.5
	setting				0
	itr1(setting + itr1)				5.154917
	itr2				9.771217
	itr3				7.103517
	itr4				9.663967
	itr5				14.85307
Spark / PageRank	page rank itteration 5	yarn-3GB	5Excutor - 2core	spark.read with 67files(80MB each)	13.6
	setting			read 하는 시간 포함 mapPartition.cache() 포함	0.68
	itr1				2.583
	itr2				2.61
	itr3				2.48
	itr4				2.61
	itr5				2.6

환경	작업	condtion1	condition2	condition3	소요시간	결과 파일 개수(row수)
Hadoop / ETL Hadoop local mode	title parsing	local mode			1033586(ms) = 17분	17773690
	find links for (id,Title) : (id,Title)	local mode	id matching		memory exceed	memory exceed
	find links for (Title) : (Title)	local mode	only title		2594021(ms) = 43분	16861907/4.4GB
Hadoop / ETL Hadoop cluster mode	title parsing	clustermode			447268(ms) = 7분	16861907
	find links for (id,Title) : (id,Title)	clustermode	id matching		819340(ms) 13분
	find links for (Title) : (Title)	clustermode	only title	memory = 1024 container 수 23 map/reduce = vcore = 2 map/reduces = 5	13.45분	16861907/4.4GB
	find links for (Title) : (Title)	clustermode	only title	memory = 2048 container 수 12	13.27분	16861907/4.4GB
ETL by Spark clustermode(1-3)	title parsing	clusermode			7분
	find links for (Title) : (Title)			excutor = 5 excutorcore = 2 memory 3G	10분 15초


Hadoop/ PageRank	page rank itteration 3	cluster mode	titles		itr 1 은 7분 20초 itr2 는 측정불가



Spark / PageRank	page rank itteration 5	cluster mode	dataset without persist		10 분
			dataset with persist	repartition X	14분
			dataset with persist	repartition 10	11분
			dataset with persist	...

*spark 테스트 중 unusuable node 등의 오류가 나오는데, local disk 의 점유가 높아지면 taskmanager (yarn) 가 일시적으로 node 를 kill 하고 다시 복구시킨다.

다만 그것이 반복되다가 결과가 제대로 나오지 않는 경우도 있다.

*위를 해결하기위해서 현재 DISK (100GB) 를 cluster 마다 추가 할당하였고, temp 파일 저장 경로를 그쪽으로 변경하였다.

* hadoop 의 경우 local_temp 경로에 shuffle 되는 결과들이 쏟아진다. 다만 계정에 temp 폴더 쓰기권한 등이 없으면 에러가난다.

* hadoop 실행 도중 중지시 temp 경로에(주로 /tmp) 로그와 셔플 중간 파일들이 쌓여있을 수 있다. 그파일은 주기적으로 정리 필요

*spark 에서 pagerank 의 경우에는 ittr 반복이 될수록 local 에 점유정도가 어느정도인지 확인필요하고, 해소방안도 필요하다.

* hadoop 에서 conf.set("mapreduce.textoutputformat.separator", "::"); 등 config 셋팅등에 유의 하자 (yarn 또한 마찬가지) Hadoop - map reduce configuration

*현재 돌고있는 프로세스는 yarn application -list 확인 후 -kill 등으로 죽일 수 있다.

hadoop 실행시간( log 기준)

		duration	비고
app start	11:07:03
Set-up	11:07:16	0:00:13
map 최초 실행	11:07:18	0:00:15	472개의 task 동시수행
Reducer 최초 실행	11:07:45	0:00:42	1개의 task가 map 에서 reduce로 오기까지 약 27초
map end time	11:14:19	0:07:16	마지막 map 이완료된시간 (reduce = 32%)
reducer end time	11:14:37	0:07:34	68% reducing 완료 되는 시간 = 18초

spark memory (0)	2018.11.20
join with spark (0)	2018.11.20
local 네트워크에서 kafka 동작이안될때 - 방화벽 (0)	2018.11.20
Spark Structed Streaming - 전체내용 번역 및 정리 (0)	2018.08.14
[Tip] Spark crontab 배치 script 에서 실행 안되는 문제 해결 (0)	2018.07.17

My data lab