amazon emr - nutch running on EMR -
can please guide me right direction. i'm trying nutch working on amazon emr. far, can nutch running locally , getting launched using shell scripts come it.
however, on amazon, need specify jar location , options. can jar compiling myself. however, don't know start far startup options concerned.
additionally, main difference between 1.x , nutch 2.0. 1 recommended on emr on other?
in case you're still looking answer:
when build nutch, see job jar in deploy directory, upload s3 , reference you're custom jar while setting emr job flow.
you can add steps , mention main class example: org.apache.nutch.crawl.crawl
, arguments want. doesn't change way works in local
mode. example: urls -dir mycrawl -threads 10 -depth 5 -topn 1000
.
you can know main class use looking @ bin/nutch
script if intend use other crawl.java.
Comments
Post a Comment