Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Witsml data processing with kafka and spark streaming

1,084 views

Published on

Presented by Dmitry Kniazev at Houston Hadoop Meeeup, 4/28/2016

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Witsml data processing with kafka and spark streaming

  1. 1. WITSML data processing example with Kafka and Spark Streaming Houston Hadoop Meetup, 4/26/2016
  2. 2. About me - Dmitry Kniazev Currently Solution Architect at EPAM Systems - About 4 years in Oil & Gas here in Houston - Started working with Hadoop about 2 years ago Before that BI/DW Specialist at EPAM Systems for 6 years - Reports, ETL with Oracle, Microsoft, Cognos and other tools - Enjoyed not SO HOT life in Eastern Europe Before that Performance Analyst at EPAM Systems for 4 years - Web Applications and Databases optimization
  3. 3. What is the problem? Source: http://www.croftsystems.net/blog/conventional-vs.-unconventional
  4. 4. What is WITSML? DATA EXCHANGE STANDARD FOR THE UPSTREAM OIL AND GAS INDUSTRY WITSML Data Store Rig Aggregation Solution Rig Aggregation Solution Corp Store WITSML Data Store Service Company #1 Operator #1 Service Company #2 WITSML based ApplicationsWITSML
  5. 5. Operator Company Data Center Architecture WITSML Data Store HBase WITSML via SOAP Internet Consumer (Scala) Producer (Scala) Service Company DC Kafka Consumer (Scala) Email / Browser
  6. 6. What is Kafka?
  7. 7. What is Spark Streaming?
  8. 8. Discretized Stream
  9. 9. Producer - prep // some important imports import com.mycompany.witsml.client.WitsmlClient //based on jwitsml 1.0 import org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord} import scala.xml.{Elem, Node, XML} // variables initialization var producer: KafkaProducer[String, String] = null var startTimeIndex = DateTime.now() var topic = "" var pollInterval = 5
  10. 10. Producer - Kafka Properties bootstrap.servers = srv1:9092,srv2:9092 key.serializer = org.apache.kafka.common.serialization.StringSerializer value.serializer = org.apache.kafka.common.serialization.StringSerializer
  11. 11. Producer - main function producer = new KafkaProducer[String, String](props) // each wellBore is a separate Kafka topic which is going to be partitioned by log topic = args(0) while (true) { val logs = WitsmlClient.getWitsmlResponse(logsQuery) // parse logs and send messages to Kafka (logs "log").foreach { node: Node => // send all data from one log to the same partition val key = (node "@uidLog").text (node "data").foreach { data => val message = new ProducerRecord(topic, null, key, data.text) producer.send(message) } }
  12. 12. Producer - results ”Well123” => Topic “5207KFSJ18” => Key (Partition) Content of <data> element => Message
  13. 13. Consumer - prep import org.apache.spark.SparkConf import org.apache.spark.sql.{Row, SQLContext} import org.apache.spark.streaming.dstream.InputDStream import org.apache.spark.streaming.kafka.KafkaUtils var schema: StructType = null val sc = new SparkConf().setAppName("WitsmlKafkaDemo") val ssc = new StreamingContext(sc, Seconds(1)) val dStream: InputDStream = KafkaUtils.createDirectStream(ssc, kafkaParams, topics) val sqlContext = new SQLContext(ssc.sparkContext)
  14. 14. Consumer - Rules Definition # fields for Spark SQL query `Co. Man G/L`,`Gain Loss - Spare`,`ACC_DRILL_STRKS` # where clause for SQL query `Co. Man G/L`>100 OR `Gain Loss - Spare`<(-42.1)
  15. 15. Consumer - main function dStream.foreachRDD( batchRDD => { val messages = batchRDD.map(_._2).map(_.split(",")) //create DataFrame with a custom schema val df = sqlContext.createDataFrame(messages, schema) //register temp table and test against rule df.registerTempTable("timeLog") val collected = sqlContext.sql("SELECT " + fields + " FROM timeLog WHERE " + condition).collect if (collected.length > 0) { //send email alert WitsmlKafkaUtil.sendEmail(collected) } }) ssc.start() ssc.awaitTermination()
  16. 16. Visualization with Highcharts
  17. 17. Why Highcharts? - Websockets support -> real-time data visualization - Multiple Y-axes that automatically scale -> many mnemonics on the same chart - Inverted X-axis -> great for Depth Logs - 3D charts that can be rotated -> Trajectories - Area range with custom colors -> Formations on the background - 100% client side javascript -> easy to deploy
  18. 18. Lessons Learned - Throw away and re-design: - Logs should be Topics, Wells(Wellbores) should be Partitions for Scalability - Producers and Consumers should be Managed Services (Flume Agents?) - Backend: - Land data to HBase (and probably OpenTSDB) - Frontend: - WebApp to visualize both NRT and historical data? - Mobile App for Alerts? - Improve Producers: - Speak many WITSML dialects? - Get ready for Real-time: - Support for ETP standard
  19. 19. Thank you! dmitry_kniazev@epam.com Links: http://www.energistics.org/ http://www.highcharts.com/ https://spark.apache.org/ http://kafka.apache.org/

×