amazon emr - nutch running on EMR -


can please guide me right direction. i'm trying nutch working on amazon emr. far, can nutch running locally , getting launched using shell scripts come it.

however, on amazon, need specify jar location , options. can jar compiling myself. however, don't know start far startup options concerned.

additionally, main difference between 1.x , nutch 2.0. 1 recommended on emr on other?

in case you're still looking answer:

when build nutch, see job jar in deploy directory, upload s3 , reference you're custom jar while setting emr job flow.

you can add steps , mention main class example: org.apache.nutch.crawl.crawl , arguments want. doesn't change way works in local mode. example: urls -dir mycrawl -threads 10 -depth 5 -topn 1000.

you can know main class use looking @ bin/nutch script if intend use other crawl.java.


Comments

Popular posts from this blog

html5 - What is breaking my page when printing? -

c# - must be a non-abstract type with a public parameterless constructor in redis -

ajax - PHP/JSON Login script (Twitter style) not setting sessions -