SlideShare a Scribd company logo
Loading + Saving Data
Loading + Saving Data
1. So far we
a. either converted in-memory data
b. Or used the HDFS file
Loading & Saving Data
Loading + Saving Data
1. So far we
a. either converted in-memory data
b. Or used the HDFS file
2. Spark supports wide variety of datasets
3. Can access data through InputFormat & OutputFormat
a. The interfaces used by Hadoop
b. Available for many common file formats and storage
systems (e.g., S3, HDFS, Cassandra, HBase, etc.).
Loading & Saving Data
Loading + Saving Data
Common Data Sources
File formats Stores
Loading + Saving Data
Common Data Sources
File formats Stores
● Text, JSON, SequenceFiles,
Protocol buffers.
● We can also configure
compression
Loading + Saving Data
Common Data Sources
File formats Stores
● Text, JSON, SequenceFiles,
Protocol buffers.
● We can also configure
compression
Filesystems
● Local, NFS, HDFS, Amazon S3
Databases and key/value stores
● For Cassandra, HBase, Elasticsearch,
and JDBC databases.
Loading + Saving Data
Structured data sources through Spark SQL aka Data Frames
+ Efficient API for structured data sources, including JSON and Apache Hive
+ Covered later
Loading & Saving Data
Loading + Saving Data
Common supported file formats
+ A file could be in any format
+ If we know upfront, we can read it and load it
+ Else we can use “file” command like tool
Loading + Saving Data
Common supported file formats
+ Very common. Plain old text files. Printable chars
+ Records are assumed to be one per line.
+ Unstructured Data
Text files
Example:
Loading + Saving Data
Common supported file formats
+ Javascript Object Notation
+ Common text-based format
+ Semistructured; most libraries require one record per line.
JSON files
{
"name" : "John",
"age" : 31,
"knows" : [“C”, “C++”]
}
Example:
Loading + Saving Data
+ Very common text-based format
+ Often used with spreadsheet applications.
+ Comma separated Values
Common supported file formats
CSV files
Example:
Loading + Saving Data
+ Compact Hadoop file format used for key/value data.
+ Key and values can be binary data
+ To bundle together many small files
Common supported file formats
Sequence files
See More at https://wiki.apache.org/hadoop/SequenceFile
Loading + Saving Data
+ A fast, space-efficient multilanguage format.
+ More compact than JSON.
Common supported file formats
Protocol buffers
See More at https://developers.google.com/protocol-buffers/
message Person {
required string name = 1;
required int32 id = 2;
optional string email = 3;
}
Loading + Saving Data
+ For data from a Spark job to be consumed by another
+ Breaks if you change your classes - Java Serialization.
Common supported file formats
Object Files
Loading + Saving Data
Loading Files
var input = sc.textFile("/data/ml-100k/u1.test")
Handling Text Files - scala
Loading + Saving Data
Loading Files
var input = sc.textFile("/data/ml-100k/u1.test")
Loading Directories
var input = sc.wholeTextFiles("/data/ml-100k");
var lengths = input.mapValues(x => x.length);
lengths.collect();
[(u'hdfs://ip-172-31-53-48.ec2.internal:8020/data/ml-100k/mku.sh', 643),
(u'hdfs://ip-172-31-53-48.ec2.internal:8020/data/ml-100k/u.data', 1979173),
(u'hdfs://ip-172-31-53-48.ec2.internal:8020/data/ml-100k/u.genre', 202),
(u'hdfs://ip-172-31-53-48.ec2.internal:8020/data/ml-100k/u.info', 36) …]
Handling Text Files - scala
Loading + Saving Data
Handling Text Files - scala
Loading Files
var input = sc.textFile("/data/ml-100k/u1.test")
Loading Directories
var input = sc.wholeTextFiles("/data/ml-100k");
var lengths = input.mapValues(x => x.length);
lengths.collect();
[(u'hdfs://ip-172-31-53-48.ec2.internal:8020/data/ml-100k/mku.sh', 643),
(u'hdfs://ip-172-31-53-48.ec2.internal:8020/data/ml-100k/u.data', 1979173),
(u'hdfs://ip-172-31-53-48.ec2.internal:8020/data/ml-100k/u.genre', 202),
(u'hdfs://ip-172-31-53-48.ec2.internal:8020/data/ml-100k/u.info', 36) …]
Saving Files
lengths.saveAsTextFile(outputDir)
Loading + Saving Data
1. Records are stored one per line,
2. Fixed number of fields per line
3. Fields are separated by a comma (tab in TSV)
4. We get row number to detect header etc.
Comma / Tab -Separated Values (CSV / TSV)
Loading + Saving Data
Data: /data/spark/temps.csv
Loading CSV - Sample Data
20, NYC, 2014-01-01
20, NYC, 2015-01-01
21, NYC, 2014-01-02
23, BLR, 2012-01-01
25, SEATLE, 2016-01-01
21, CHICAGO, 2013-01-05
24, NYC, 2016-5-05
Loading + Saving Data
Loading CSV - Simple Approach
Array(
Array(20, " NYC", " 2014-01-01"),
Array(20, " NYC", " 2015-01-01"),
Array(21, " NYC", " 2014-01-02"),
Array(23, " BLR", " 2012-01-01"),
Array(25, " SEATLE", " 2016-01-01"),
Array(21, " CHICAGO", " 2013-01-05"),
Array(24, " NYC", " 2016-5-05")
)
var lines = sc.textFile("/data/spark/temps.csv");
var recordsRDD = lines.map(line => line.split(","));
recordsRDD.take(10);
Loading + Saving Data
import au.com.bytecode.opencsv.CSVParser
var a = sc.textFile("/data/spark/temps.csv");
var p = a.map(
line => {
val parser = new CSVParser(',')
parser.parseLine(line)
})
p.take(1)
//Array(Array(20, " NYC", " 2014-01-01"))
spark-shell --packages net.sf.opencsv:opencsv:2.3
Or
Add this to sbt: libraryDependencies += "net.sf.opencsv" % "opencsv" % "2.3"
Loading CSV - Example
https://gist.github.com/girisandeep/b721cf93981c338665c328441d419253
Loading + Saving Data
Loading CSV - Example Efficient
https://gist.github.com/girisandeep/fddf49ef97fde429a0d3256160b257c1
Loading + Saving Data
import au.com.bytecode.opencsv.CSVParser
var linesRdd = sc.textFile("/data/spark/temps.csv");
Loading CSV - Example Efficient
https://gist.github.com/girisandeep/fddf49ef97fde429a0d3256160b257c1
Loading + Saving Data
import au.com.bytecode.opencsv.CSVParser
var linesRdd = sc.textFile("/data/spark/temps.csv");
def parseCSV(itr:Iterator[String]):Iterator[Array[String]] = {
val parser = new CSVParser(',')
for(line <- itr)
yield parser.parseLine(line)
}
Loading CSV - Example Efficient
https://gist.github.com/girisandeep/fddf49ef97fde429a0d3256160b257c1
Loading + Saving Data
import au.com.bytecode.opencsv.CSVParser
var linesRdd = sc.textFile("/data/spark/temps.csv");
def parseCSV(itr:Iterator[String]):Iterator[Array[String]] = {
val parser = new CSVParser(',')
for(line <- itr)
yield parser.parseLine(line)
}
//Check with simple example
val x = parseCSV(Array("1,2,3","a,b,c").iterator)
val result = linesRdd.mapPartitions(parseCSV)
Loading CSV - Example Efficient
https://gist.github.com/girisandeep/fddf49ef97fde429a0d3256160b257c1
Loading + Saving Data
import au.com.bytecode.opencsv.CSVParser
var linesRdd = sc.textFile("/data/spark/temps.csv");
def parseCSV(itr:Iterator[String]):Iterator[Array[String]] = {
val parser = new CSVParser(',')
for(line <- itr)
yield parser.parseLine(line)
}
//Check with simple example
val x = parseCSV(Array("1,2,3","a,b,c").iterator)
val result = linesRdd.mapPartitions(parseCSV)
result.take(1)
//Array[Array[String]] = Array(Array(20, " NYC", " 2014-01-01"))
Loading CSV - Example Efficient
https://gist.github.com/girisandeep/fddf49ef97fde429a0d3256160b257c1
Loading + Saving Data
Tab Separated Files
Similar to csv:
val parser = new CSVParser('t')
Loading + Saving Data
● Popular Hadoop format
○ For handling small files
○ Create InputSplits without too much transport
SequenceFiles
Loading + Saving Data
SequenceFiles
● Popular Hadoop format
○ For handling small files
○ Create InputSplits without too much transport
● Composed of flat files with key/value pairs.
● Has Sync markers
○ Allow to seek to a point
○ Then resynchronize with the record boundaries
○ Allows Spark to efficiently read in parallel from multiple nodes
Loading + Saving Data
Loading SequenceFiles
val data = sc.sequenceFile(inFile,
"org.apache.hadoop.io.Text", "org.apache.hadoop.io.IntWritable")
data.map(func)
…
data.saveAsSequenceFile(outputFile)
Loading + Saving Data
var rdd = sc.parallelize(Array(("key1", 1.0), ("key2", 2.0), ("key3", 3.0)))
rdd.saveAsSequenceFile("pysequencefile1")
Saving SequenceFiles - Example
Loading + Saving Data
var rdd = sc.parallelize(Array(("key1", 1.0), ("key2", 2.0), ("key3", 3.0)))
rdd.saveAsSequenceFile("pysequencefile1")
Saving SequenceFiles - Example
Loading + Saving Data
import org.apache.hadoop.io.DoubleWritable
import org.apache.hadoop.io.Text
Loading SequenceFiles - Example
Loading + Saving Data
import org.apache.hadoop.io.DoubleWritable
import org.apache.hadoop.io.Text
val myrdd = sc.sequenceFile(
"pysequencefile1",
classOf[Text], classOf[DoubleWritable])
Loading SequenceFiles - Example
Loading + Saving Data
Loading SequenceFiles - Example
import org.apache.hadoop.io.DoubleWritable
import org.apache.hadoop.io.Text
val myrdd = sc.sequenceFile(
"pysequencefile1",
classOf[Text], classOf[DoubleWritable])
val result = myrdd.map{case (x, y) => (x.toString, y.get())}
result.collect()
//Array((key1,1.0), (key2,2.0), (key3,3.0))
Loading + Saving Data
● Simple wrapper around SequenceFiles
● Values are written out using Java Serialization.
● Intended to be used for Spark jobs communicating with other
Spark jobs
● Can also be quite slow.
Object Files
Loading + Saving Data
Object Files
● Saving - saveAsObjectFile() on an RDD
● Loading - objectFile() on SparkContext
● Require almost no work to save almost arbitrary objects.
● Not available in python using pickle file instead
● If you change the objects, old files may not be valid
Loading + Saving Data
Pickle File
● Python way of handling object files
● Uses Python’s pickle serialization library
● Saving - saveAsPickleFile() on an RDD
● Loading - pickleFile() on SparkContext
● Can also be quite slow as Object Fiels
Loading + Saving Data
● Access Hadoop-supported storage formats
● Many key/value stores provide Hadoop input formats
● Example providers:HBase, MongoDB
● Older: hadoopFile() / saveAsHadoopFile()
● Newer: newAPIHadoopDataset() / saveAsNewAPIHadoopDataset()
● Takes a Configuration object on which you set the Hadoop properties
Non-filesystem data sources - hadoopFile
Loading + Saving Data
Read an ‘old’ Hadoop InputFormat with arbitrary key and value class from HDFS, a local file
system (available on all nodes), or any Hadoop-supported file system URI. The mechanism is
the same as for sc.sequenceFile.
A Hadoop configuration can be passed in as a Python dict. This will be converted into a
Configuration in Java.
Parameters:
path – path to Hadoop file
inputFormatClass – fully qualified classname of Hadoop InputFormat (e.g. “org.apache.hadoop.mapred.TextInputFormat”)
keyClass – fully qualified classname of key Writable class (e.g. “org.apache.hadoop.io.Text”)
valueClass – fully qualified classname of value Writable class (e.g. “org.apache.hadoop.io.LongWritable”)
keyConverter – (None by default)
valueConverter – (None by default)
conf – Hadoop configuration, passed in as a dict (None by default)
batchSize – The number of Python objects represented as a single Java object. (default 0, choose batchSize automatically)
Hadoop Input and Output Formats - Old API
hadoopFile(path, inputFormatClass, keyClass, valueClass, keyConverter=None,
valueConverter=None, conf=None, batchSize=0)
Loading + Saving Data
Read a ‘new API’ Hadoop InputFormat with arbitrary key and value class from HDFS, a local
file system (available on all nodes), or any Hadoop-supported file system URI. The mechanism
is the same as for sc.sequenceFile.
A Hadoop configuration can be passed in as a Python dict. This will be converted into a
Configuration in Java
Parameters:
path – path to Hadoop file
inputFormatClass – fully qualified classname of Hadoop InputFormat (e.g.
“org.apache.hadoop.mapreduce.lib.input.TextInputFormat”)
keyClass – fully qualified classname of key Writable class (e.g. “org.apache.hadoop.io.Text”)
valueClass – fully qualified classname of value Writable class (e.g. “org.apache.hadoop.io.LongWritable”)
keyConverter – (None by default)
valueConverter – (None by default)
conf – Hadoop configuration, passed in as a dict (None by default)
batchSize – The number of Python objects represented as a single Java object. (default 0, choose batchSize automatically)
Hadoop Input and Output Formats - New API
newAPIHadoopFile(path, inputFormatClass, keyClass, valueClass, keyConverter=None,
valueConverter=None, conf=None, batchSize=0)
Loading + Saving Data
See https://databricks.com/blog/2015/03/20/using-mongodb-with-spark.html
# set up parameters for reading from MongoDB via Hadoop input format
config = {"mongo.input.uri": "mongodb://localhost:27017/marketdata.minbars"}
inputFormatClassName = "com.mongodb.hadoop.MongoInputFormat"
keyClassName = "org.apache.hadoop.io.Text"
valueClassName = "org.apache.hadoop.io.MapWritable"
# read the 1-minute bars from MongoDB into Spark RDD format
minBarRawRDD = sc.newAPIHadoopRDD(inputFormatClassName, keyClassName,
valueClassName, None, None, config)
Hadoop Input and Output Formats - Old API
Loading Data from mongodb
Loading + Saving Data
● Developed at Google for internal RPCs
● Open sourced
● Structured data - fields & types of fields defined
● Fast for encoding and decoding (20-100x than XML)
● Take up the minimum space (3-10x than xml)
● Defined using a domain-specific language
● Compiler generates accessor methods in variety of languages
● Consist of fields: optional, required, or repeated
● While parsing
○ A missing optional field => success
○ A missing required field => failure
● So, make new fields as optional (remember object file failures?)
Protocol buffers
Loading + Saving Data
Protocol buffers - Example
package tutorial;
message Person {
required string name = 1;
required int32 id = 2;
optional string email = 3;
enum PhoneType {
MOBILE = 0;
HOME = 1;
WORK = 2;
}
message PhoneNumber {
required string number = 1;
optional PhoneType type = 2 [default = HOME];
}
repeated PhoneNumber phone = 4;
}
message AddressBook {
repeated Person person = 1;
}
Loading + Saving Data
1. Download and install protocol buffer compiler
2. pip install protobuf
3. protoc -I=$SRC_DIR --python_out=$DST_DIR
$SRC_DIR/addressbook.proto
4. create objects
5. Convert those into protocol buffers
6. See this project
Protocol buffers - Steps
Loading + Saving Data
1. To Save Storage & Network Overhead
2. With most hadoop output formats we can specify compression codecs
3. Compression should not require the whole file at once
4. Each worker can find start of record => splitable
5. You can configure HDP for LZO using Ambari:
http://docs.hortonworks.com/HDPDocuments/Ambari-2.2.2.0/bk_ambari
_reference_guide/content/_configure_core-sitexml_for_lzo.html
File Compression
Loading + Saving Data
File Compression Options
Characteristics of compression:
● Splittable
● Speed
● Effectiveness on Text
● Code
Loading + Saving Data
Format Splittable Spee
d
Effectiveness on
text
Hadoop compression codec comments
gzip N Fast High org.apache.hadoop.io.com.GzipCodec
lzo Y V.
Fast
Medium com.hadoop.compression.lzo.LzoCode
c
LZO requires
installation on
every worker
node
bzip2 Y Slow V. High org.apache.hadoop.io.com.BZip2Codec Uses pure Java
for splittable
version
zlib N Slow Medium org.apache.hadoop.io.com.DefaultCode
c
Default
compression
codec for
Hadoop
Snappy N V.
Fast
Low org.apache.hadoop.io.com.SnappyCod
ec
There is a pure
Java port of
Snappy but it is
not yet available
in Spark/
Hadoop
File Compression Options
Loading + Saving Data
1. Enable in HADOOP by updating the conf of hadoop
http://docs.hortonworks.com/HDPDocuments/Ambari-2.2.2.0/bk_ambari_r
eference_guide/content/_configure_core-sitexml_for_lzo.html
2. Create data:
$ bzip2 --stdout file.bz2 | lzop -o file.lzo
3. Update Spark-env.sh with
export SPARK_CLASSPATH=$SPARK_CLASSPATH:hadoop-lzo-0.4.20-SNAPSHOT.jar
In your code, use:
conf.set("io.compression.codecs”, "com.hadoop.compression.lzo.LzopCodec”);
Ref: https://gist.github.com/zedar/c43cbc7ff7f98abee885
Handling LZO
Loading + Saving Data
Loading + Saving Data: File Systems
Loading + Saving Data
1. rdd = sc.textFile("file:///home/holden/happypandas.gz")
2. The path has to be available on all nodes.
Otherwise, load it locally and distribute using sc.parallelize
Local/“Regular” FS
Loading + Saving Data
1. Popular option
2. Good if nodes are inside EC2
3. Use path in all input methods (textFile, hadoopFile etc)
s3n://bucket/path-within-bucket
4. Set Env. Vars: AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY
More details: https://cloudxlab.com/blog/access-s3-files-spark/
Amazon S3
Loading + Saving Data
1. The Hadoop Distributed File System
2. Spark and HDFS can be collocated on the same machines
3. Spark can take advantage of this data locality to avoid network overhead
4. In all i/o methods, use path: hdfs://master:port/path
5. Use only the version of spark w.r.t HDFS version
HDFS
Thank you!
Spark - Loading + Saving Data

More Related Content

What's hot

Apache spark Intro
Apache spark IntroApache spark Intro
Apache spark Intro
Tudor Lapusan
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat Sheet
Hortonworks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Datio Big Data
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Robert Sanders
 
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
Russell Spitzer
 
Apache Spark RDD 101
Apache Spark RDD 101Apache Spark RDD 101
Apache Spark RDD 101
sparkInstructor
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
Farzad Nozarian
 
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryFrustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Ilya Ganelin
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
Hakka Labs
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
Corley S.r.l.
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
ragho
 
DTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsDTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime Internals
Cheng Lian
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
Dean Chen
 
Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Akhil Das
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
Spark cassandra integration 2016
Spark cassandra integration 2016Spark cassandra integration 2016
Spark cassandra integration 2016
Duyhai Doan
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
Duyhai Doan
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
Databricks
 

What's hot (19)

Apache spark Intro
Apache spark IntroApache spark Intro
Apache spark Intro
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat Sheet
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
 
Apache Spark RDD 101
Apache Spark RDD 101Apache Spark RDD 101
Apache Spark RDD 101
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryFrustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
 
DTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsDTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime Internals
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Spark cassandra integration 2016
Spark cassandra integration 2016Spark cassandra integration 2016
Spark cassandra integration 2016
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
 

Similar to Apache Spark - Loading & Saving data | Big Data Hadoop Spark Tutorial | CloudxLab

Stream or not to Stream?

Stream or not to Stream?
Stream or not to Stream?

Stream or not to Stream?

Lukasz Byczynski
 
Learning spark ch05 - Loading and Saving Your Data
Learning spark ch05 - Loading and Saving Your DataLearning spark ch05 - Loading and Saving Your Data
Learning spark ch05 - Loading and Saving Your Data
phanleson
 
Building highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache SparkBuilding highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache Spark
Martin Toshev
 
Map-Reduce and Apache Hadoop
Map-Reduce and Apache HadoopMap-Reduce and Apache Hadoop
Map-Reduce and Apache Hadoop
Svetlin Nakov
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
spinningmatt
 
Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0
Scott Leberknight
 
Cloudera Impala Overview (via Scott Leberknight)
Cloudera Impala Overview (via Scott Leberknight)Cloudera Impala Overview (via Scott Leberknight)
Cloudera Impala Overview (via Scott Leberknight)
Cloudera, Inc.
 
Hands-On Apache Spark
Hands-On Apache SparkHands-On Apache Spark
Hands-On Apache Spark
Duchess France
 
Data Analytics using MATLAB and HDF5
Data Analytics using MATLAB and HDF5Data Analytics using MATLAB and HDF5
Data Analytics using MATLAB and HDF5
The HDF-EOS Tools and Information Center
 
Learning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQLLearning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQL
phanleson
 
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Péter Király
 
Intro to Spark
Intro to SparkIntro to Spark
Intro to Spark
Kyle Burke
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
Sudar Muthu
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
Hadoop and object stores can we do it better
Hadoop and object stores  can we do it betterHadoop and object stores  can we do it better
Hadoop and object stores can we do it better
gvernik
 
Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?
gvernik
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
Vienna Data Science Group
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs Pig
Lester Martin
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
YahooTechConference
 
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
Spark Summit
 

Similar to Apache Spark - Loading & Saving data | Big Data Hadoop Spark Tutorial | CloudxLab (20)

Stream or not to Stream?

Stream or not to Stream?
Stream or not to Stream?

Stream or not to Stream?

 
Learning spark ch05 - Loading and Saving Your Data
Learning spark ch05 - Loading and Saving Your DataLearning spark ch05 - Loading and Saving Your Data
Learning spark ch05 - Loading and Saving Your Data
 
Building highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache SparkBuilding highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache Spark
 
Map-Reduce and Apache Hadoop
Map-Reduce and Apache HadoopMap-Reduce and Apache Hadoop
Map-Reduce and Apache Hadoop
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
 
Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0
 
Cloudera Impala Overview (via Scott Leberknight)
Cloudera Impala Overview (via Scott Leberknight)Cloudera Impala Overview (via Scott Leberknight)
Cloudera Impala Overview (via Scott Leberknight)
 
Hands-On Apache Spark
Hands-On Apache SparkHands-On Apache Spark
Hands-On Apache Spark
 
Data Analytics using MATLAB and HDF5
Data Analytics using MATLAB and HDF5Data Analytics using MATLAB and HDF5
Data Analytics using MATLAB and HDF5
 
Learning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQLLearning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQL
 
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
 
Intro to Spark
Intro to SparkIntro to Spark
Intro to Spark
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Hadoop and object stores can we do it better
Hadoop and object stores  can we do it betterHadoop and object stores  can we do it better
Hadoop and object stores can we do it better
 
Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs Pig
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
 

More from CloudxLab

Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep Learning
CloudxLab
 
Deep Learning Overview
Deep Learning OverviewDeep Learning Overview
Deep Learning Overview
CloudxLab
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
CloudxLab
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
CloudxLab
 
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes
CloudxLab
 
Autoencoders
AutoencodersAutoencoders
Autoencoders
CloudxLab
 
Training Deep Neural Nets
Training Deep Neural NetsTraining Deep Neural Nets
Training Deep Neural Nets
CloudxLab
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
CloudxLab
 
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
CloudxLab
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
CloudxLab
 
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLabIntroduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
CloudxLab
 
Introduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLabIntroduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLab
CloudxLab
 
Dimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabDimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLab
CloudxLab
 
Ensemble Learning and Random Forests
Ensemble Learning and Random ForestsEnsemble Learning and Random Forests
Ensemble Learning and Random Forests
CloudxLab
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
CloudxLab
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
CloudxLab
 
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 

More from CloudxLab (20)

Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep Learning
 
Deep Learning Overview
Deep Learning OverviewDeep Learning Overview
Deep Learning Overview
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes
 
Autoencoders
AutoencodersAutoencoders
Autoencoders
 
Training Deep Neural Nets
Training Deep Neural NetsTraining Deep Neural Nets
Training Deep Neural Nets
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
 
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLabIntroduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
 
Introduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLabIntroduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLab
 
Dimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabDimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLab
 
Ensemble Learning and Random Forests
Ensemble Learning and Random ForestsEnsemble Learning and Random Forests
Ensemble Learning and Random Forests
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
 

Recently uploaded

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 

Recently uploaded (20)

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 

Apache Spark - Loading & Saving data | Big Data Hadoop Spark Tutorial | CloudxLab

  • 2. Loading + Saving Data 1. So far we a. either converted in-memory data b. Or used the HDFS file Loading & Saving Data
  • 3. Loading + Saving Data 1. So far we a. either converted in-memory data b. Or used the HDFS file 2. Spark supports wide variety of datasets 3. Can access data through InputFormat & OutputFormat a. The interfaces used by Hadoop b. Available for many common file formats and storage systems (e.g., S3, HDFS, Cassandra, HBase, etc.). Loading & Saving Data
  • 4. Loading + Saving Data Common Data Sources File formats Stores
  • 5. Loading + Saving Data Common Data Sources File formats Stores ● Text, JSON, SequenceFiles, Protocol buffers. ● We can also configure compression
  • 6. Loading + Saving Data Common Data Sources File formats Stores ● Text, JSON, SequenceFiles, Protocol buffers. ● We can also configure compression Filesystems ● Local, NFS, HDFS, Amazon S3 Databases and key/value stores ● For Cassandra, HBase, Elasticsearch, and JDBC databases.
  • 7. Loading + Saving Data Structured data sources through Spark SQL aka Data Frames + Efficient API for structured data sources, including JSON and Apache Hive + Covered later Loading & Saving Data
  • 8. Loading + Saving Data Common supported file formats + A file could be in any format + If we know upfront, we can read it and load it + Else we can use “file” command like tool
  • 9. Loading + Saving Data Common supported file formats + Very common. Plain old text files. Printable chars + Records are assumed to be one per line. + Unstructured Data Text files Example:
  • 10. Loading + Saving Data Common supported file formats + Javascript Object Notation + Common text-based format + Semistructured; most libraries require one record per line. JSON files { "name" : "John", "age" : 31, "knows" : [“C”, “C++”] } Example:
  • 11. Loading + Saving Data + Very common text-based format + Often used with spreadsheet applications. + Comma separated Values Common supported file formats CSV files Example:
  • 12. Loading + Saving Data + Compact Hadoop file format used for key/value data. + Key and values can be binary data + To bundle together many small files Common supported file formats Sequence files See More at https://wiki.apache.org/hadoop/SequenceFile
  • 13. Loading + Saving Data + A fast, space-efficient multilanguage format. + More compact than JSON. Common supported file formats Protocol buffers See More at https://developers.google.com/protocol-buffers/ message Person { required string name = 1; required int32 id = 2; optional string email = 3; }
  • 14. Loading + Saving Data + For data from a Spark job to be consumed by another + Breaks if you change your classes - Java Serialization. Common supported file formats Object Files
  • 15. Loading + Saving Data Loading Files var input = sc.textFile("/data/ml-100k/u1.test") Handling Text Files - scala
  • 16. Loading + Saving Data Loading Files var input = sc.textFile("/data/ml-100k/u1.test") Loading Directories var input = sc.wholeTextFiles("/data/ml-100k"); var lengths = input.mapValues(x => x.length); lengths.collect(); [(u'hdfs://ip-172-31-53-48.ec2.internal:8020/data/ml-100k/mku.sh', 643), (u'hdfs://ip-172-31-53-48.ec2.internal:8020/data/ml-100k/u.data', 1979173), (u'hdfs://ip-172-31-53-48.ec2.internal:8020/data/ml-100k/u.genre', 202), (u'hdfs://ip-172-31-53-48.ec2.internal:8020/data/ml-100k/u.info', 36) …] Handling Text Files - scala
  • 17. Loading + Saving Data Handling Text Files - scala Loading Files var input = sc.textFile("/data/ml-100k/u1.test") Loading Directories var input = sc.wholeTextFiles("/data/ml-100k"); var lengths = input.mapValues(x => x.length); lengths.collect(); [(u'hdfs://ip-172-31-53-48.ec2.internal:8020/data/ml-100k/mku.sh', 643), (u'hdfs://ip-172-31-53-48.ec2.internal:8020/data/ml-100k/u.data', 1979173), (u'hdfs://ip-172-31-53-48.ec2.internal:8020/data/ml-100k/u.genre', 202), (u'hdfs://ip-172-31-53-48.ec2.internal:8020/data/ml-100k/u.info', 36) …] Saving Files lengths.saveAsTextFile(outputDir)
  • 18. Loading + Saving Data 1. Records are stored one per line, 2. Fixed number of fields per line 3. Fields are separated by a comma (tab in TSV) 4. We get row number to detect header etc. Comma / Tab -Separated Values (CSV / TSV)
  • 19. Loading + Saving Data Data: /data/spark/temps.csv Loading CSV - Sample Data 20, NYC, 2014-01-01 20, NYC, 2015-01-01 21, NYC, 2014-01-02 23, BLR, 2012-01-01 25, SEATLE, 2016-01-01 21, CHICAGO, 2013-01-05 24, NYC, 2016-5-05
  • 20. Loading + Saving Data Loading CSV - Simple Approach Array( Array(20, " NYC", " 2014-01-01"), Array(20, " NYC", " 2015-01-01"), Array(21, " NYC", " 2014-01-02"), Array(23, " BLR", " 2012-01-01"), Array(25, " SEATLE", " 2016-01-01"), Array(21, " CHICAGO", " 2013-01-05"), Array(24, " NYC", " 2016-5-05") ) var lines = sc.textFile("/data/spark/temps.csv"); var recordsRDD = lines.map(line => line.split(",")); recordsRDD.take(10);
  • 21. Loading + Saving Data import au.com.bytecode.opencsv.CSVParser var a = sc.textFile("/data/spark/temps.csv"); var p = a.map( line => { val parser = new CSVParser(',') parser.parseLine(line) }) p.take(1) //Array(Array(20, " NYC", " 2014-01-01")) spark-shell --packages net.sf.opencsv:opencsv:2.3 Or Add this to sbt: libraryDependencies += "net.sf.opencsv" % "opencsv" % "2.3" Loading CSV - Example https://gist.github.com/girisandeep/b721cf93981c338665c328441d419253
  • 22. Loading + Saving Data Loading CSV - Example Efficient https://gist.github.com/girisandeep/fddf49ef97fde429a0d3256160b257c1
  • 23. Loading + Saving Data import au.com.bytecode.opencsv.CSVParser var linesRdd = sc.textFile("/data/spark/temps.csv"); Loading CSV - Example Efficient https://gist.github.com/girisandeep/fddf49ef97fde429a0d3256160b257c1
  • 24. Loading + Saving Data import au.com.bytecode.opencsv.CSVParser var linesRdd = sc.textFile("/data/spark/temps.csv"); def parseCSV(itr:Iterator[String]):Iterator[Array[String]] = { val parser = new CSVParser(',') for(line <- itr) yield parser.parseLine(line) } Loading CSV - Example Efficient https://gist.github.com/girisandeep/fddf49ef97fde429a0d3256160b257c1
  • 25. Loading + Saving Data import au.com.bytecode.opencsv.CSVParser var linesRdd = sc.textFile("/data/spark/temps.csv"); def parseCSV(itr:Iterator[String]):Iterator[Array[String]] = { val parser = new CSVParser(',') for(line <- itr) yield parser.parseLine(line) } //Check with simple example val x = parseCSV(Array("1,2,3","a,b,c").iterator) val result = linesRdd.mapPartitions(parseCSV) Loading CSV - Example Efficient https://gist.github.com/girisandeep/fddf49ef97fde429a0d3256160b257c1
  • 26. Loading + Saving Data import au.com.bytecode.opencsv.CSVParser var linesRdd = sc.textFile("/data/spark/temps.csv"); def parseCSV(itr:Iterator[String]):Iterator[Array[String]] = { val parser = new CSVParser(',') for(line <- itr) yield parser.parseLine(line) } //Check with simple example val x = parseCSV(Array("1,2,3","a,b,c").iterator) val result = linesRdd.mapPartitions(parseCSV) result.take(1) //Array[Array[String]] = Array(Array(20, " NYC", " 2014-01-01")) Loading CSV - Example Efficient https://gist.github.com/girisandeep/fddf49ef97fde429a0d3256160b257c1
  • 27. Loading + Saving Data Tab Separated Files Similar to csv: val parser = new CSVParser('t')
  • 28. Loading + Saving Data ● Popular Hadoop format ○ For handling small files ○ Create InputSplits without too much transport SequenceFiles
  • 29. Loading + Saving Data SequenceFiles ● Popular Hadoop format ○ For handling small files ○ Create InputSplits without too much transport ● Composed of flat files with key/value pairs. ● Has Sync markers ○ Allow to seek to a point ○ Then resynchronize with the record boundaries ○ Allows Spark to efficiently read in parallel from multiple nodes
  • 30. Loading + Saving Data Loading SequenceFiles val data = sc.sequenceFile(inFile, "org.apache.hadoop.io.Text", "org.apache.hadoop.io.IntWritable") data.map(func) … data.saveAsSequenceFile(outputFile)
  • 31. Loading + Saving Data var rdd = sc.parallelize(Array(("key1", 1.0), ("key2", 2.0), ("key3", 3.0))) rdd.saveAsSequenceFile("pysequencefile1") Saving SequenceFiles - Example
  • 32. Loading + Saving Data var rdd = sc.parallelize(Array(("key1", 1.0), ("key2", 2.0), ("key3", 3.0))) rdd.saveAsSequenceFile("pysequencefile1") Saving SequenceFiles - Example
  • 33. Loading + Saving Data import org.apache.hadoop.io.DoubleWritable import org.apache.hadoop.io.Text Loading SequenceFiles - Example
  • 34. Loading + Saving Data import org.apache.hadoop.io.DoubleWritable import org.apache.hadoop.io.Text val myrdd = sc.sequenceFile( "pysequencefile1", classOf[Text], classOf[DoubleWritable]) Loading SequenceFiles - Example
  • 35. Loading + Saving Data Loading SequenceFiles - Example import org.apache.hadoop.io.DoubleWritable import org.apache.hadoop.io.Text val myrdd = sc.sequenceFile( "pysequencefile1", classOf[Text], classOf[DoubleWritable]) val result = myrdd.map{case (x, y) => (x.toString, y.get())} result.collect() //Array((key1,1.0), (key2,2.0), (key3,3.0))
  • 36. Loading + Saving Data ● Simple wrapper around SequenceFiles ● Values are written out using Java Serialization. ● Intended to be used for Spark jobs communicating with other Spark jobs ● Can also be quite slow. Object Files
  • 37. Loading + Saving Data Object Files ● Saving - saveAsObjectFile() on an RDD ● Loading - objectFile() on SparkContext ● Require almost no work to save almost arbitrary objects. ● Not available in python using pickle file instead ● If you change the objects, old files may not be valid
  • 38. Loading + Saving Data Pickle File ● Python way of handling object files ● Uses Python’s pickle serialization library ● Saving - saveAsPickleFile() on an RDD ● Loading - pickleFile() on SparkContext ● Can also be quite slow as Object Fiels
  • 39. Loading + Saving Data ● Access Hadoop-supported storage formats ● Many key/value stores provide Hadoop input formats ● Example providers:HBase, MongoDB ● Older: hadoopFile() / saveAsHadoopFile() ● Newer: newAPIHadoopDataset() / saveAsNewAPIHadoopDataset() ● Takes a Configuration object on which you set the Hadoop properties Non-filesystem data sources - hadoopFile
  • 40. Loading + Saving Data Read an ‘old’ Hadoop InputFormat with arbitrary key and value class from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. The mechanism is the same as for sc.sequenceFile. A Hadoop configuration can be passed in as a Python dict. This will be converted into a Configuration in Java. Parameters: path – path to Hadoop file inputFormatClass – fully qualified classname of Hadoop InputFormat (e.g. “org.apache.hadoop.mapred.TextInputFormat”) keyClass – fully qualified classname of key Writable class (e.g. “org.apache.hadoop.io.Text”) valueClass – fully qualified classname of value Writable class (e.g. “org.apache.hadoop.io.LongWritable”) keyConverter – (None by default) valueConverter – (None by default) conf – Hadoop configuration, passed in as a dict (None by default) batchSize – The number of Python objects represented as a single Java object. (default 0, choose batchSize automatically) Hadoop Input and Output Formats - Old API hadoopFile(path, inputFormatClass, keyClass, valueClass, keyConverter=None, valueConverter=None, conf=None, batchSize=0)
  • 41. Loading + Saving Data Read a ‘new API’ Hadoop InputFormat with arbitrary key and value class from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. The mechanism is the same as for sc.sequenceFile. A Hadoop configuration can be passed in as a Python dict. This will be converted into a Configuration in Java Parameters: path – path to Hadoop file inputFormatClass – fully qualified classname of Hadoop InputFormat (e.g. “org.apache.hadoop.mapreduce.lib.input.TextInputFormat”) keyClass – fully qualified classname of key Writable class (e.g. “org.apache.hadoop.io.Text”) valueClass – fully qualified classname of value Writable class (e.g. “org.apache.hadoop.io.LongWritable”) keyConverter – (None by default) valueConverter – (None by default) conf – Hadoop configuration, passed in as a dict (None by default) batchSize – The number of Python objects represented as a single Java object. (default 0, choose batchSize automatically) Hadoop Input and Output Formats - New API newAPIHadoopFile(path, inputFormatClass, keyClass, valueClass, keyConverter=None, valueConverter=None, conf=None, batchSize=0)
  • 42. Loading + Saving Data See https://databricks.com/blog/2015/03/20/using-mongodb-with-spark.html # set up parameters for reading from MongoDB via Hadoop input format config = {"mongo.input.uri": "mongodb://localhost:27017/marketdata.minbars"} inputFormatClassName = "com.mongodb.hadoop.MongoInputFormat" keyClassName = "org.apache.hadoop.io.Text" valueClassName = "org.apache.hadoop.io.MapWritable" # read the 1-minute bars from MongoDB into Spark RDD format minBarRawRDD = sc.newAPIHadoopRDD(inputFormatClassName, keyClassName, valueClassName, None, None, config) Hadoop Input and Output Formats - Old API Loading Data from mongodb
  • 43. Loading + Saving Data ● Developed at Google for internal RPCs ● Open sourced ● Structured data - fields & types of fields defined ● Fast for encoding and decoding (20-100x than XML) ● Take up the minimum space (3-10x than xml) ● Defined using a domain-specific language ● Compiler generates accessor methods in variety of languages ● Consist of fields: optional, required, or repeated ● While parsing ○ A missing optional field => success ○ A missing required field => failure ● So, make new fields as optional (remember object file failures?) Protocol buffers
  • 44. Loading + Saving Data Protocol buffers - Example package tutorial; message Person { required string name = 1; required int32 id = 2; optional string email = 3; enum PhoneType { MOBILE = 0; HOME = 1; WORK = 2; } message PhoneNumber { required string number = 1; optional PhoneType type = 2 [default = HOME]; } repeated PhoneNumber phone = 4; } message AddressBook { repeated Person person = 1; }
  • 45. Loading + Saving Data 1. Download and install protocol buffer compiler 2. pip install protobuf 3. protoc -I=$SRC_DIR --python_out=$DST_DIR $SRC_DIR/addressbook.proto 4. create objects 5. Convert those into protocol buffers 6. See this project Protocol buffers - Steps
  • 46. Loading + Saving Data 1. To Save Storage & Network Overhead 2. With most hadoop output formats we can specify compression codecs 3. Compression should not require the whole file at once 4. Each worker can find start of record => splitable 5. You can configure HDP for LZO using Ambari: http://docs.hortonworks.com/HDPDocuments/Ambari-2.2.2.0/bk_ambari _reference_guide/content/_configure_core-sitexml_for_lzo.html File Compression
  • 47. Loading + Saving Data File Compression Options Characteristics of compression: ● Splittable ● Speed ● Effectiveness on Text ● Code
  • 48. Loading + Saving Data Format Splittable Spee d Effectiveness on text Hadoop compression codec comments gzip N Fast High org.apache.hadoop.io.com.GzipCodec lzo Y V. Fast Medium com.hadoop.compression.lzo.LzoCode c LZO requires installation on every worker node bzip2 Y Slow V. High org.apache.hadoop.io.com.BZip2Codec Uses pure Java for splittable version zlib N Slow Medium org.apache.hadoop.io.com.DefaultCode c Default compression codec for Hadoop Snappy N V. Fast Low org.apache.hadoop.io.com.SnappyCod ec There is a pure Java port of Snappy but it is not yet available in Spark/ Hadoop File Compression Options
  • 49. Loading + Saving Data 1. Enable in HADOOP by updating the conf of hadoop http://docs.hortonworks.com/HDPDocuments/Ambari-2.2.2.0/bk_ambari_r eference_guide/content/_configure_core-sitexml_for_lzo.html 2. Create data: $ bzip2 --stdout file.bz2 | lzop -o file.lzo 3. Update Spark-env.sh with export SPARK_CLASSPATH=$SPARK_CLASSPATH:hadoop-lzo-0.4.20-SNAPSHOT.jar In your code, use: conf.set("io.compression.codecs”, "com.hadoop.compression.lzo.LzopCodec”); Ref: https://gist.github.com/zedar/c43cbc7ff7f98abee885 Handling LZO
  • 50. Loading + Saving Data Loading + Saving Data: File Systems
  • 51. Loading + Saving Data 1. rdd = sc.textFile("file:///home/holden/happypandas.gz") 2. The path has to be available on all nodes. Otherwise, load it locally and distribute using sc.parallelize Local/“Regular” FS
  • 52. Loading + Saving Data 1. Popular option 2. Good if nodes are inside EC2 3. Use path in all input methods (textFile, hadoopFile etc) s3n://bucket/path-within-bucket 4. Set Env. Vars: AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY More details: https://cloudxlab.com/blog/access-s3-files-spark/ Amazon S3
  • 53. Loading + Saving Data 1. The Hadoop Distributed File System 2. Spark and HDFS can be collocated on the same machines 3. Spark can take advantage of this data locality to avoid network overhead 4. In all i/o methods, use path: hdfs://master:port/path 5. Use only the version of spark w.r.t HDFS version HDFS
  • 54. Thank you! Spark - Loading + Saving Data