Ingest and Indexing in CDH4 Hadoop Environment


Published on

Ingest and Indexing of RSS News Feeds in the Hadoop Environment. Entailed using Flume, Morphlines, and ClouderaSearch.

Published in: Technology, Education

Ingest and Indexing in CDH4 Hadoop Environment

  1. 1. Ingest and Indexing of RSS News Feeds in the Hadoop Environment Stephanie F. Guadagno January 2014 SFG- 1/9/2014 1
  2. 2. Introduction • Work is being done on a Virtual Machine, loaded with Cloudera’s CDH 4.3. • Used Flume 1.3, Cloudera’s Morphlines, Cloudera Search with Solr 4.3, Hadoop 2.0. • Used Flume to pull over RSS News Feeds from BBC World News into HDFS. • The news data, in HDFS, was indexed and loaded into Solr using the MapReduceIndexerTool and the Cloudera’s Morphlines framework. SFG- 1/9/2014 2
  3. 3. Overview of Components Used • Flume is used to reliably ingest large amounts of data from various sources (e.g. log files, Web Sites, Social Media Sites) into a centralized or distributed data store, such HBase or HDFS. • MapReduceIndexerTool is a MapReduce batch job driver used with Cloudera Search. The tool is used to index a set of input files and then write the indexes into HDFS. The GoLive feature will merge the output shards into a set of live Solr servers (e.g. a SolrCloud). • Cloudera Morphlines is a new open source framework that facilitates simple ETL of ingested data into Apache Solr. The framework consists of the new Morphlines library and specifications for creating a “morphline”, which encompasses a chain of transformation commands. • Cloudera Search facilitates Big Data search by bringing search and scalable indexing from Solr 4.X into the Hadoop ecosystem. SFG- 1/9/2014 3
  4. 4. Flume’s Data Flow • • • • • A Flume Agent is a Java process that hosts the Flume Source, Channel, and Sink components through which events flow from an external source to the next destination. An event is a unit of data that flows through the components. A Flume Source listens for events and writes the event to the Channel. The Channel queues the events as transactions. The Flume Sink writes the event to the external source (e.g. HDFS, HBase, Solr, or a file) and removes the event from the queue. SFG- 1/9/2014 External Source (e.g. Social Media, Log files, Web Pages, RSS News Feed) in a format recognized by the Flume Source Channel Source (e.g. Memory, File, JDBC) (e.g. Avro, Exec, HTTP, JMS, Syslog, etc.) Agent Sink (e.g. File, HDFS, HBase, Morphline Solr Sink) in a format specified by the Flume Sink File HBase, HDFS, Solr 4
  5. 5. Morphline Data Flow • • • • • Cloudera’s Morphlines is a Java library that was developed as part of Cloudera Search. The library contains a suite of frequently used transformation and I/O “command” classes for use in simple ETL on data flows into Solr. The library can be integrated into Flume for near-real-time ETL or into MapReduce for batch ETL. For batch ETL, Cloudera provides the MapReduceIndexerTool for data in HDFS. For data in HBase, the tool is the HBaseMapReduceIndexerTool. A morphline will consume input records, which are then turned into a stream of records. The stream of records are piped through a chain of transformation commands. SFG- 1/9/2014 Source B a t c h HDFS, HBase cmd N R T … cmd record Flume Source cmd record Morphline Solr 5
  6. 6. News Feed ETL Data Flow 1) Ingest using Flume 2) Index using MapReduce and Morphline External Source Custom Source Morphline (BBC RSS News Feeds – us, uk, asia, etc.) Configuration File Memory Channel Avro JSON record(s) HDFS Sink MapReduceIndexerTool MapReduce Agent (org.apache.solr.hadoop.MapReduceIndexerTool) (“agent”) Avro JSON record(s) HDFS ("newsfeeds/”) • • • • Implemented a Custom Flume Event Driven Source to pull RSS News Feeds from BBC World News. Details: – Must implement Flume’s EventDrivenSource interface – Parsed the News Feeds items – Wrote each item to the Channel in Avro JSON format Ensured the Agent was defined. CDH4 came with an agent called “tier1”. I created an agent called “agent”. Configured the Data Flow in a Flume agent configuration file. Wrote a script that runs the flume agent with the agent configuration file. SFG- 1/9/2014 Solr Cloud (“news_feeds”) • • • • Created Solr Instance for the “news_feeds” collection with modified Schema for fields in news feed data. Created the “news_feeds” collection with 1 shard. Wrote Morphine File Wrote a script that runs the MapReduceIndexerTool with the Morphline specification file. 6
  7. 7. News Feed Ingest Details-1 of 2 (Configuration) # Flume Data Flow Configuration # ----------------------------------------------# Definitions agent.sources=news-source agent.channels=memory-ch agent.sinks=hdfs-sink External Source (BBC RSS News Feeds – us, uk, asia, etc.) Custom Source Memory Channel Avro JSON record(s) HDFS Sink Agent # Channel (memory channel with queue capacity of 5000) agent.channels.memory-ch.type=memory agent.channels.memory-ch.capacity=5000 (“agent”) HDFS ("newsfeeds/”) Chose: 1. Custom Flume Event Driven Source 2. Memory Channel 3. HDFS Sink SFG- 1/9/2014 # Sources (ingest using RSSFlumeSourceReader class) # Sink (output to HDFS in Text format) agent.sinks.hdfs-sink.type=hdfs agent.sinks.hdfs-sink.hdfs.path= hdfs://localhost:8020/user/cloudera/flume/newsfeeds agent.sinks.hdfs-sink.hdfs.filePrefix=input agent.sinks.hdfs-sink.hdfs.fileType = DataStream agent.sinks.hdfs-sink.hdfs.writeFormat = Text 7
  8. 8. News Feed Ingest Details-2 of 2 (Custom Source – RSSFlumeSourceReader) public class RSSFlumeSourceReader extends AbstractSource implements EventDrivenSource, Configurable { ChannelProcessor cp = getChannelProcessor(); External Source (BBC RSS News Feeds – us, uk, asia, etc.) Custom Source Memory Channel Avro JSON record(s) HDFS Sink Agent (“agent”) HDFS ("newsfeeds/”) SFG- 1/9/2014 @Override public synchronized void start() { super.start(); // for each URL { // read RSS News Feeds; using // obtain Document by parsing news using DocumentBuilder from // javax.xml.parsers // get NodeList object for “item” tag contain in the Document object // for each node in the NodeList object { // write data in Avro JSON format using Apache Avro library } // create Flume Event and Send Event to Channel Event event = EventBuilder.withBody(out.toString(), Charsets.UTF_8); cp.processEvent(event); } } @Override public synchronized void stop() { super.stop(); } } 8
  9. 9. News Feed Data in HDFS SFG- 1/9/2014 9
  10. 10. News Feed Data Indexing Details-1 of 3 (MapReduce) Morphline Configuration File MapReduceIndexerTool Avro JSON record(s) MapReduce (org.apache.solr.hadoop.MapReduceIndexerTool) HDFS Solr Cloud ("newsfeeds/”) (“news_feeds”) Two tools being used: 1. HdfsFindTool : used to get the most recent files changed within the past day. 2. MapReducerIndexerTool: will run MapReduce job to index the HDFS input files and push the index to Solr. # Go-live merges the output shards of the previous phase into a # set of on-line Solr servers. # echo “Running go-live mode“ hadoop jar /usr/lib/solr/contrib/mr/search-mr-*-job.jar org.apache.solr.hadoop.HdfsFindTool -find hdfs:///${HDFS_INDIR} -type f -name 'in*' -mtime -1 | sudo -u hdfs hadoop --config /etc/hadoop/conf.cloudera.mapreduce1 jar /usr/lib/solr/contrib/mr/search-mr-*-job.jar org.apache.solr.hadoop.MapReduceIndexerTool --libjars /usr/lib/solr/contrib/mr/search-mr-0.9.1-cdh4.3.0-SNAPSHOT.jar -D '' --log4j ${LOG_FILE} ${DRYRUN} --morphline-file ${MORPHLINE_FILE} --update-conflict-resolver org.apache.solr.hadoop.dedup.RetainMostRecentUpdateConflictResolver ${REDUCERS_ARG} --verbose --output-dir hdfs://localhost:8020/${HDFS_SOLR_IDXDIR} --go-live --zk-host localhost:2181/solr --collection ${COLLECTION} --input-list echo “Clean-up tmp directory" sudo -u hdfs rm /tmp/solr*.zip echo "Done." SFG- 1/9/2014 10
  11. 11. News Feed Data Indexing Details-2 of 3 (Morphline) SOLR_LOCATOR : { # specifiy collection and zkHost } morphlines : [ { id : morphlineNewsFeed importCommands : ["com.cloudera.**"] commands : [ “readAvro” Record “extractAvroPaths” Record “convertTimestamp” Record “sanitizeUnknown SolrFields” Tid-bits { readAvro { isJson : true writerSchemaFile: /home/dataingest/schema/NewsRecord.avsc }} { extractAvroPaths { flatten : false paths : { id: /id title: /Title url: /Link published_date: /Publish_Date author: /Author comments: /Comments description: /Description } }} Record “loadSolr” Document ]}] Solr Cloud (“news_feeds”) SFG- 1/9/2014 { loadSolr { solrLocator : ${SOLR_LOCATOR} } }  HOCON format: Human-Optimized Configuration format. JSON-like format  Morphline is defined with a tree of commands.  The output of one command is sent to the next command.  The morphline is compiled at run-time. # Convert last_modified to native Solr timestamp format { convertTimestamp { field : published_date inputFormats : ["EEE, d MMM yyyy HH:mm:ss z", "EEE, dd MMM yyyy HH:mm:ss z"] inputTimezone : GMT outputTimezone: US/Eastern } } # Solr will throw an exception on any attempt to load # a document containing a field not specified in schema.xml. { sanitizeUnknownSolrFields { # Location from which to fetch Solr schema solrLocator : ${SOLR_LOCATOR} } } 11
  12. 12. News Feed Data Indexing Details-3 of 3 (Solr Collection) • The “news_feeds” Solr collection presently contains 3800 documents in the index. SFG- 1/9/2014 12
  13. 13. News Feed Document in Solr SFG- 1/9/2014 13
  14. 14. Summary • Presented ingest of RSS News Feeds using – Flume with a Custom Source, Memory Channel, and HDFS Sink • Presented indexing of News Feed data into HDFS using – Cloudera’s Morphlines library and “morphline” configuration – Cloudera’s MapReduceIndexerTool – Cloudera Search with Solr 4.X SFG- 1/9/2014 14
  15. 15. Thank You Stephanie F. Guadagno January 2014 SFG- 1/9/2014 15
  16. 16. References • • • • • • Flume Developer’s Guide; Apache Software Foundation; 2009-2012 Flume User Guide;; The Apache Software Foundation; 2009-2012 GoLive;; Cloudera, Inc.; 2014 MapReduceIndexerTool;; Cloudera, Inc.; 2014 Morphlines;; Wolfgang Hoschek; Cloudera, Inc.; July 11, 2013 Morphlines ETL;; Cloudera, Inc.; 2014 SFG- 1/9/2014 16