Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Jean Georges Perrin, @jgperrin, Oplo
Extending Spark’s
Ingestion: Build Your Own
Java Data Source
#EUdev6
2#EUdev6
JGP
@jgperrin
Jean Georges Perrin
Chapel Hill, NC, USA
I 🏗SW
#Knowledge =
𝑓 ( ∑ (#SmallData, #BigData), #DataScie...
A Little Help
Hello Ruby,
Nathaniel,
Jack,
and PN!
3@jgperrin - #EUdev6
And What About You?
• Who is a Java programmer?
• Who swears by Scala?
• Who has tried to build a custom data source?
• In...
My Problem…
5@jgperrin - #EUdev6
CSV
JSON
RDBMS REST
Other
Solution #1
6@jgperrin - #EUdev6
• Look in the Spark Packages library
• https://spark-packages.org/
• 48 data sources list...
package
net.jgp.labs.spark.datasources.l100_
photo_datasource;
import org.apache.spark.sql.Dataset;
import org.apache.spar...
What is ingestion anyway?
• Loading
– CSV, JSON, RDBMS…
– Basically any data
• Getting a Dataframe (Dataset<Row>) with you...
9@jgperrin - #EUdev6
Our lab
Import some of
the metadata
of our photos
and start
analysis…
Code
Dataset<Row> df = spark.read()
.format("exif")
.option("recursive", "true")
.option("limit", "80000")
.option("extens...
Short name
• Optional: can specify a full class name
• An alias to a class name
• Needs to be registered
• Not recommended...
Project
12#EUdev6
Information for
registration of
the data source
Data source’s
short name
Data source
The app
Relation
Ut...
Library code
• A library you already use, you developed or
acquired
• No need for anything special
13@jgperrin - #EUdev6
/jgperrin
/net.jgp.labs.spark.datasources
Scala 2 Java
conversion
(sorry!) – All the
options
Data Source Code
14@jgperrin ...
Relation
• Plumbing between
– Spark
– Your existing library
• Mission
– Returns the schema as a StructType
– Returns the d...
/jgperrin
/net.jgp.labs.spark.datasources
TableScan is the
key, other more
specialized Scan
available
Relation
16@jgperrin...
/jgperrin
/net.jgp.labs.spark.datasources
A utility function
that introspect a
Java bean and
turn it into a
”Super” schema...
/jgperrin
/net.jgp.labs.spark.datasources
Collect the data
Relation – Data
18@jgperrin - #EUdev6
@Override
public RDD<Row>...
/jgperrin
/net.jgp.labs.spark.datasources
Application
19@jgperrin - #EUdev6
import org.apache.spark.sql.Dataset;
import or...
Output
+--------------------+-------+---------+----------+--------------------+-------------------+-------------------+---...
dsfsd
21#EUdev6
IMG_0176.JPG
Rijsttafel in Rotterdam, NL
2.6m (9ft) above sea level
dsfsd
22#EUdev6
IMG_0176.JPG
Hoover Dam & Lake Mead, AZ
9314.7m (30560ft) above sea level
@jgperrin - #EUdev6
A Little Extra
23
Data
Metadata
(Schema)
Schema, Beans, and Annotations
24@jgperrin - #EUdev6
Schema
Bean
StructType
SparkBeanUtils.
getSchemaFromBean()
SparkBeanU...
Mass production
• Easily import from any Java Bean
• Conversion done by utility functions
• Schema is a superset of Struct...
Conclusion
• Reuse the Bean to schema and Bean to data in
your project
• Building custom data source to REST server or non...
Going Further
• Check out the code (fork/like)
https://github.com/jgperrin/net.jgp.labs.spark.dataso
urces
• Follow me @jg...
Go raibh maith agaibh.
Jean Georges “JGP” Perrin
@jgperrin
Don’t forget
to rate
Backup Slides & Other Resources
30Discover CLEGO @ http://bit.ly/spark-clego
More Reading & Resources
• My blog: http://jgp.net
• This code on GitHub:
https://github.com/jgperrin/net.jgp.labs.spark.d...
Photo & Other Credits
• JGP @ Oplo, Problem, Solution 1&2, Relation,
Beans: Pexels
• Lab, Hoover Dam, Clego: © Jean George...
Abstract
EXTENDING APACHE SPARK'S INGESTION: BUILDING YOUR OWN JAVA DATA SOURCE
By Jean Georges Perrin (@jgperrin, Oplo)
A...
You’ve finished this document.
Download and read it offline.
Upcoming SlideShare
From Pipelines to Refineries: scaling big data applications with Tim Hunter
Next
Upcoming SlideShare
From Pipelines to Refineries: scaling big data applications with Tim Hunter
Next
Download to read offline and view in fullscreen.

Share

Extending Spark's Ingestion: Build Your Own Java Data Source with Jean Georges Perrin

Download to read offline

Apache Spark is a wonderful platform for running your analytics jobs. It has great ingestion features from CSV, Hive, JDBC, etc. however, you may have your own data sources or formats you want to use. Your solution could be to convert your data in a CSV or JSON file and then ask Spark to do ingest it through its built-in tools. However, for enhanced performance, we will explore the way to build a data source, in Java, to extend Spark’s ingestion capabilities. We will first understand how Spark works for ingestion, then walk through the development of this data source plug-in. Targeted audience Software and data engineers who need to expand Spark’s ingestion capability. Key takeaways Requirements, needs & architecture – 15%. Build the required tool set in Java – 85%.

Extending Spark's Ingestion: Build Your Own Java Data Source with Jean Georges Perrin

  1. 1. Jean Georges Perrin, @jgperrin, Oplo Extending Spark’s Ingestion: Build Your Own Java Data Source #EUdev6
  2. 2. 2#EUdev6 JGP @jgperrin Jean Georges Perrin Chapel Hill, NC, USA I 🏗SW #Knowledge = 𝑓 ( ∑ (#SmallData, #BigData), #DataScience) & #Software #IBMChampion x9 #KeepLearning @ http://jgp.net 2#EUdev6 I💚#
  3. 3. A Little Help Hello Ruby, Nathaniel, Jack, and PN! 3@jgperrin - #EUdev6
  4. 4. And What About You? • Who is a Java programmer? • Who swears by Scala? • Who has tried to build a custom data source? • In Java? • Who has succeeded? 4@jgperrin - #EUdev6
  5. 5. My Problem… 5@jgperrin - #EUdev6 CSV JSON RDBMS REST Other
  6. 6. Solution #1 6@jgperrin - #EUdev6 • Look in the Spark Packages library • https://spark-packages.org/ • 48 data sources listed (some dups) • Some with Source Code
  7. 7. package net.jgp.labs.spark.datasources.l100_ photo_datasource; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession; public class PhotoMetadataIngestionApp { public static void main(String[] args) { PhotoMetadataIngestionApp app = new PhotoMetadataIngestionApp(); app.start(); } private boolean start() { SparkSession spark = SparkSession.builder() .appName("EXIF to Dataset") .master("local[*]"). getOrCreate(); String importDirectory = "/Users/jgp/Pictures/All Photos/2010-2019/2016"; Solution #2 7@jgperrin - #EUdev6 • Write it yourself • In Java
  8. 8. What is ingestion anyway? • Loading – CSV, JSON, RDBMS… – Basically any data • Getting a Dataframe (Dataset<Row>) with your data 8@jgperrin - #EUdev6
  9. 9. 9@jgperrin - #EUdev6 Our lab Import some of the metadata of our photos and start analysis…
  10. 10. Code Dataset<Row> df = spark.read() .format("exif") .option("recursive", "true") .option("limit", "80000") .option("extensions", "jpg,jpeg") .load(importDirectory); #EUdev6 Spark Session Short name for your data source Options Where to start /jgperrin/net.jgp.labs.spark.datasources
  11. 11. Short name • Optional: can specify a full class name • An alias to a class name • Needs to be registered • Not recommended during development phase 11@jgperrin - #EUdev6
  12. 12. Project 12#EUdev6 Information for registration of the data source Data source’s short name Data source The app Relation Utilities An existing library you can embed/link to /jgperrin /net.jgp.labs.spark.datasources
  13. 13. Library code • A library you already use, you developed or acquired • No need for anything special 13@jgperrin - #EUdev6
  14. 14. /jgperrin /net.jgp.labs.spark.datasources Scala 2 Java conversion (sorry!) – All the options Data Source Code 14@jgperrin - #EUdev6 Provides a relation Needed by Spark Needs to be passed to the relation public class ExifDirectoryDataSource implements RelationProvider { @Override public BaseRelation createRelation( SQLContext sqlContext, Map<String, String> params) { java.util.Map<String, String> javaMap = mapAsJavaMapConverter(params).asJava(); ExifDirectoryRelation br = new ExifDirectoryRelation(); br.setSqlContext(sqlContext); ... br.setPhotoLister(photoLister); return br; } } The relation to be exploited
  15. 15. Relation • Plumbing between – Spark – Your existing library • Mission – Returns the schema as a StructType – Returns the data as a RDD<Row> 15@jgperrin - #EUdev6
  16. 16. /jgperrin /net.jgp.labs.spark.datasources TableScan is the key, other more specialized Scan available Relation 16@jgperrin - #EUdev6 The schema: may/will be called first The data… SQL Context public class ExifDirectoryRelation extends BaseRelation implements Serializable, TableScan { private static final long serialVersionUID = 4598175080399877334L; @Override public RDD<Row> buildScan() { ... return rowRDD.rdd(); } @Override public StructType schema() { ... return schema.getSparkSchema(); } @Override public SQLContext sqlContext() { return this.sqlContext; } ... }
  17. 17. /jgperrin /net.jgp.labs.spark.datasources A utility function that introspect a Java bean and turn it into a ”Super” schema, which contains the required StructType for Spark Relation – Schema 17@jgperrin - #EUdev6 @Override public StructType schema() { if (schema == null) { schema = SparkBeanUtils.getSchemaFromBean(PhotoMetadata.class); } return schema.getSparkSchema(); }
  18. 18. /jgperrin /net.jgp.labs.spark.datasources Collect the data Relation – Data 18@jgperrin - #EUdev6 @Override public RDD<Row> buildScan() { schema(); List<PhotoMetadata> table = collectData(); JavaSparkContext sparkContext = new JavaSparkContext(sqlContext.sparkContext()); JavaRDD<Row> rowRDD = sparkContext.parallelize(table) .map(photo -> SparkBeanUtils.getRowFromBean(schema, photo)); return rowRDD.rdd(); } private List<PhotoMetadata> collectData() { List<File> photosToProcess = this.photoLister.getFiles(); List<PhotoMetadata> list = new ArrayList<PhotoMetadata>(); PhotoMetadata photo; for (File photoToProcess : photosToProcess) { photo = ExifUtils.processFromFilename(photoToProcess.getAbsolutePath()); list.add(photo); } return list; } Scans the files, extract EXIF information: the interface to your library… Creates the RDD by parallelizing the list of photos
  19. 19. /jgperrin /net.jgp.labs.spark.datasources Application 19@jgperrin - #EUdev6 import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession; public class PhotoMetadataIngestionApp { public static void main(String[] args) { PhotoMetadataIngestionApp app = new PhotoMetadataIngestionApp(); app.start(); } private boolean start() { SparkSession spark = SparkSession.builder() .appName("EXIF to Dataset").master("local[*]").getOrCreate(); Dataset<Row> df = spark.read() .format("exif") .option("recursive", "true") .option("limit", "80000") .option("extensions", "jpg,jpeg") .load("/Users/jgp/Pictures"); df = df .filter(df.col("GeoX").isNotNull()) .filter(df.col("GeoZ").notEqual("NaN")) .orderBy(df.col("GeoZ").desc()); System.out.println("I have imported " + df.count() + " photos."); df.printSchema(); df.show(5); return true; } } Normal imports, no reference to our data source Local mode Classic read Standard dataframe API: getting my ”highest” photos! Standard output mechanism
  20. 20. Output +--------------------+-------+---------+----------+--------------------+-------------------+-------------------+--------------------+--------------------+-----------+-------------------+---------+----------+------+-----+ | Name| Size|Extension| MimeType| Directory| FileCreationDate| FileLastAccessDate|FileLastModifiedDate| Filename| GeoY| Date| GeoX| GeoZ|Height|Width| +--------------------+-------+---------+----------+--------------------+-------------------+-------------------+--------------------+--------------------+-----------+-------------------+---------+----------+------+-----+ | IMG_0177.JPG|1129296| JPG|image/jpeg|/Users/jgp/Pictur...|2015-04-14 08:14:42|2017-10-24 11:20:31| 2015-04-14 08:14:42|/Users/jgp/Pictur...| -111.17341|2015-04-14 15:14:42| 36.38115| 9314.743| 1936| 2592| | IMG_0176.JPG|1206265| JPG|image/jpeg|/Users/jgp/Pictur...|2015-04-14 09:14:36|2017-10-24 11:20:31| 2015-04-14 09:14:36|/Users/jgp/Pictur...| -111.17341|2015-04-14 16:14:36| 36.38115| 9314.743| 1936| 2592| | IMG_1202.JPG|1799552| JPG|image/jpeg|/Users/jgp/Pictur...|2015-05-05 17:52:15|2017-10-24 11:20:28| 2015-05-05 17:52:15|/Users/jgp/Pictur...| -98.471405|2015-05-05 15:52:15|29.528196| 9128.197| 2448| 3264| | IMG_1203.JPG|2412908| JPG|image/jpeg|/Users/jgp/Pictur...|2015-05-05 18:05:25|2017-10-24 11:20:28| 2015-05-05 18:05:25|/Users/jgp/Pictur...| -98.4719|2015-05-05 16:05:25|29.526403| 9128.197| 2448| 3264| | IMG_1204.JPG|2261133| JPG|image/jpeg|/Users/jgp/Pictur...|2015-05-05 15:13:25|2017-10-24 11:20:29| 2015-05-05 15:13:25|/Users/jgp/Pictur...| -98.47248|2015-05-05 16:13:25|29.526756| 9128.197| 2448| 3264| +--------------------+-------+---------+----------+--------------------+-------------------+-------------------+--------------------+--------------------+-----------+-------------------+---------+----------+------+-----+ only showing top 5 rows 20@jgperrin - #EUdev6 +--------------------+-------+-----------+-------------------+---------+----------+ | Name| Size| GeoY| Date| GeoX| GeoZ| +--------------------+-------+-----------+-------------------+---------+----------+ | IMG_0177.JPG|1129296| -111.17341|2015-04-14 15:14:42| 36.38115| 9314.743| | IMG_0176.JPG|1206265| -111.17341|2015-04-14 16:14:36| 36.38115| 9314.743| | IMG_1202.JPG|1799552| -98.471405|2015-05-05 15:52:15|29.528196| 9128.197| | IMG_1203.JPG|2412908| -98.4719|2015-05-05 16:05:25|29.526403| 9128.197| | IMG_1204.JPG|2261133| -98.47248|2015-05-05 16:13:25|29.526756| 9128.197| +--------------------+-------+-----------+-------------------+---------+----------+ only showing top 5 rows
  21. 21. dsfsd 21#EUdev6 IMG_0176.JPG Rijsttafel in Rotterdam, NL 2.6m (9ft) above sea level
  22. 22. dsfsd 22#EUdev6 IMG_0176.JPG Hoover Dam & Lake Mead, AZ 9314.7m (30560ft) above sea level
  23. 23. @jgperrin - #EUdev6 A Little Extra 23 Data Metadata (Schema)
  24. 24. Schema, Beans, and Annotations 24@jgperrin - #EUdev6 Schema Bean StructType SparkBeanUtils. getSchemaFromBean() SparkBeanUtils. getRowFromBean() Schema. getSparkSchema() RowColumn Column Column Column Can be augmented via the @SparkColumn annotation
  25. 25. Mass production • Easily import from any Java Bean • Conversion done by utility functions • Schema is a superset of StructType 25#EUdev6@jgperrin - #EUdev6
  26. 26. Conclusion • Reuse the Bean to schema and Bean to data in your project • Building custom data source to REST server or non standard format is easy • No need for a pricey conversion to CSV or JSON • There is always a solution in Java • On you: check for parallelism, optimization, extend schema (order of columns) 26@jgperrin - #EUdev6
  27. 27. Going Further • Check out the code (fork/like) https://github.com/jgperrin/net.jgp.labs.spark.dataso urces • Follow me @jgperrin • Watch for my Java + Spark book, coming out soon! • (If you come in the RTP area in NC, USA, come for a Spark meet-up and let’s have a drink!) 27#EUdev6
  28. 28. Go raibh maith agaibh. Jean Georges “JGP” Perrin @jgperrin Don’t forget to rate
  29. 29. Backup Slides & Other Resources
  30. 30. 30Discover CLEGO @ http://bit.ly/spark-clego
  31. 31. More Reading & Resources • My blog: http://jgp.net • This code on GitHub: https://github.com/jgperrin/net.jgp.labs.spark.dat asources • Java Code on GitHub: https://github.com/jgperrin/net.jgp.labs.spark /jgperrin/net.jgp.labs.spark.datasources
  32. 32. Photo & Other Credits • JGP @ Oplo, Problem, Solution 1&2, Relation, Beans: Pexels • Lab, Hoover Dam, Clego: © Jean Georges Perrin • Rijsttafel: © Nathaniel Perrin
  33. 33. Abstract EXTENDING APACHE SPARK'S INGESTION: BUILDING YOUR OWN JAVA DATA SOURCE By Jean Georges Perrin (@jgperrin, Oplo) Apache Spark is a wonderful platform for running your analytics jobs. It has great ingestion features from CSV, Hive, JDBC, etc. however, you may have your own data sources or formats you want to use. Your solution could be to convert your data in a CSV or JSON file and then ask Spark to do ingest it through its built-in tools. However, for enhanced performance, we will explore the way to build a data source, in Java, to extend Spark’s ingestion capabilities. We will first understand how Spark works for ingestion, then walk through the development of this data source plug-in. Targeted audience: Software and data engineers who need to expand Spark’s ingestion capability. Key takeaways: • Requirements, needs & architecture – 15%. • Build the required tool set in Java – 85%. Session hashtag: #EUdev6
  • alexatshare

    Oct. 16, 2018
  • jgperrin

    Mar. 18, 2018
  • MichaelLi100

    Feb. 9, 2018

Apache Spark is a wonderful platform for running your analytics jobs. It has great ingestion features from CSV, Hive, JDBC, etc. however, you may have your own data sources or formats you want to use. Your solution could be to convert your data in a CSV or JSON file and then ask Spark to do ingest it through its built-in tools. However, for enhanced performance, we will explore the way to build a data source, in Java, to extend Spark’s ingestion capabilities. We will first understand how Spark works for ingestion, then walk through the development of this data source plug-in. Targeted audience Software and data engineers who need to expand Spark’s ingestion capability. Key takeaways Requirements, needs & architecture – 15%. Build the required tool set in Java – 85%.

Views

Total views

2,659

On Slideshare

0

From embeds

0

Number of embeds

65

Actions

Downloads

66

Shares

0

Comments

0

Likes

3

×