The document discusses Spark and Python for data processing. It describes Spark's features like processing large datasets, SQL-like data processing, machine learning, and supporting various file formats. It provides examples of RDD, DataFrame, and SQL in Spark. It also demonstrates local development of Spark applications with Docker and deployment to AWS EMR. Code examples show reading, writing, and analyzing data with PySpark.
1. Memulai Data Processing dengan
Spark dan Python
Ridwan Fadjar
Web Developer @Ebizu
Article Writer @(Codepolitan | POSS UPI | Serverless ID | Labbasdat CS UPI)
3. Fitur – Fitur Spark
● Largeset Dataset Processing
● Data Processing dengan sintaks seperti SQL
● Graph Processing
● Machine Learning diatas Spark
● Menerima data stream dari Kafka atau Kinesis
● Mendukung bahasa pemrograman seperti Java, Scala, Python, dan R
● File – file yang didukung oleh Spark antara lain: CSV, Orc, Parquet, Text, dan
lainnya
8. Local Development
● Pasang Docker di Laptop Kamu
● Download container spark yang dibuat oleh singularities
● Jalankan Mesin
● Buat contoh kode dan simpan di dalam container
● Jalankan dengan spark-submit
9. Local Development (1)
● Contoh perintah 1: spark-submit –deploy-local client –master local script.py
● Contoh perintah 2: spark-submit –deploy-local client –master local[*] script.py
● Contoh perintah 3: spark-submit –deploy-local cluster –master yarn script.py
● Contoh perintah 4: spark-submit –deploy-local cluster –master yarn script.py –
pyFiles config.py
● Dan lainnya
10. Local Development (3)
● Aplikasi spark harus memiliki parameter agar dapat menerima kasus secara
dinamis dalam satu script
● Selalu mempunyai input dataset dan output dataset
● Bisa hanya satu node yaitu master saja, atau dengan 1 worker
● Gunakan PIP untuk instalasi dependensi yang dibutuhkan
● Lakukan unit test terhadap function atau library yang kamu buat sendiri
● Pastikan segala library yang dibutuhkan sudah terpasang di master dan
worker
11. Deployment
● Simpan source code di Github
● Gunakan Ansible untuk deploy aplikasi Spark dan mengatur konfigurasi
aplikasi tersebut
● Gunakan Ansible untuk mengatur konfigurasi di dalam node
12. Deployment di AWS
● Jalankan script langsung diatas AWS ElasticMapReduce
● Menggunakan AWS EMR Step dan Cluster melalui AWS Console
● Menggunakakn AWS EMR Step dan Cluster melalui AWS CLI yang dijadwal
oleh Cron
● Menggunakan AWS EMR Step dan Cluster melalui AWS CLI yang dijadwal
oleh Scheduler seperti Luigi, Apache Oozie, atau Apache Airflow
13. Integrasi Spark dengan Solusi Lain
● MySQL
● Kafka
● Elasticsearch
● Redis
● MemSQL
● AWS Kinesis
● Dan lainnya
14. Contoh Kode PySpark (1)
from pyspark import SparkConf, SparkContext
logFile = "/data/README.md" # Should be some file on your system
sc = SparkContext("local", "Simple App")
sc.setLogLevel("ERROR")
logData = sc.textFile(logFile).cache()
numAs = logData.filter(lambda s: 'a' in s).count()
numBs = logData.filter(lambda s: 'b' in s).count()
print("Lines with a: %i, lines with b: %i" % (numAs, numBs))
16. Contoh Kode PySpark (3)
from pyspark import SparkContext, SparkConf
from random import randint
# http://localhost:9200/spark/_search/?size=1000&pretty=1
# spark-submit --jars /tmp/data/elasticsearch-hadoop-5.4.0.jar /tmp/data/spark-es-write-test.py
sc = SparkContext("local", "Simple App")
sc.setLogLevel("ERROR")
es_conf = {
"es.nodes" : "192.168.46.49",
"es.port" : "9200",
"es.resource" : "spark/docs",
}
if __name__ == "__main__":
rdd = sc.parallelize([ (i, { "x":i, "y":"lorem ipsum sit dolor amet", "z":randint(0, 1000)} ) for i in range(0, 100) ])
rdd.saveAsNewAPIHadoopFile(
path='-',
outputFormatClass="org.elasticsearch.hadoop.mr.EsOutputFormat",
keyClass="org.apache.hadoop.io.NullWritable",
valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
conf=es_conf)
17. Contoh Kode PySpark (4)
from pyspark import SparkContext
from pyspark.sql import SQLContext
sc = SparkContext("local[2]", "WordCountStreaming")
sc.setLogLevel("ERROR")
ssc = StreamingContext(sc, 10)
lines = ssc.socketTextStream("10.2.2.38", 9999)
words = lines.flatMap(lambda line: line.split(" "))
# Count each word in each batch
pairs = words.map(lambda word: (word, 1))
wordCounts = pairs.reduceByKey(lambda x, y: x + y)
# Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.pprint()
ssc.start() # Start the computation
ssc.awaitTermination() # Wait for the computation to terminate
18. Contoh Kode PySpark (5)
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
sc = SparkContext("local[2]", "WordCountStreaming")
sc.setLogLevel("ERROR")
ssc = StreamingContext(sc, 10)
topic = "test"
lines = KafkaUtils.createStream(ssc, "10.2.2.38:2181", "topic", {topic: 4})
words = lines.flatMap(lambda line: line[1].split(" "))
# Count each word in each batch
pairs = words.map(lambda word: (word, 1))
wordCounts = pairs.reduceByKey(lambda x, y: x + y)
# Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.pprint()
ssc.start() # Start the computation
ssc.awaitTermination() # Wait for the computation to terminate
19. Contoh Kode PySpark (6)
from pyspark import SparkContext
from pyspark.sql import SparkSession, SQLContext
from random import randint
from datetime import timedelta, datetime
sc = SparkContext()
sc.setLogLevel("ERROR")
ss = SparkSession(sc)
sqlCtx = SQLContext(sc)
dataset = sc.textFile("/data/contoso/user-*.csv").map(lambda line: line.split("|"))
for row in dataset.take(5):
print ("-->")
print (row)
dframe = dataset.toDF()
dframe.show()
print(dframe.count())
try:
dframe.write.partitionBy("_6").format("parquet").save("user.parquet")
except Exception:
print("The parquet file was existed")
20. Contoh Kode PySpark (7)
from pyspark import SparkContext
from pyspark.sql import SparkSession, SQLContext
from random import randint
from datetime import timedelta, datetime
sc = SparkContext()
sc.setLogLevel("ERROR")
ss = SparkSession(sc)
sqlCtx = SQLContext(sc)
dframe = ss.read.load("/user/spark/user.parquet")
dframe.show()
print(dframe.count())
21. Contoh Kode PySpark (8)
from pyspark import SparkContext
from pyspark.sql import SparkSession, SQLContext
from pyspark.sql.types import IntegerType, TimestampType, ByteType, ShortType, StringType, DecimalType, StructField, StructType
from random import randint
from datetime import timedelta, datetime
spark = SparkSession
.builder
.appName("Python Spark SQL Hive integration example")
.config("spark.sql.warehouse.dir", "./spark-warehouse")
.enableHiveSupport()
.getOrCreate()
# dataset = sc.textFile("/data/campaign/campaign-metadata-sample-1.csv").map(lambda line: line.split("|"))
# for row in dataset.take(5):
# print ("-->")
# print (row)
schema = StructType([
StructField("metadata_id",StringType(),False),
StructField("type",StringType(),True),
StructField("event",StringType(),True),
StructField("metadata",StringType(),True),
StructField("application_id",StringType(),True),
StructField("created_at",StringType(),True),
StructField("api_version",StringType(),True)
])
dframe = spark.read.schema(schema).option("delimiter", "|").csv("/data/campaign/campaign-metadata-sample-1.csv")
dframe.show()
try:
dframe.write.partitionBy("_6").format("orc").save("campaign-metadata")
except Exception as e:
print (e)
print("The orc file was existed")
22. Managed Service Spark
- HortonWorks
- Azure HD Insight
- Amazon Web Service ElasticMapReduce
- Cloudera Spark
- Databricks
- dan lainnya
23. Solusi lain seperti Apache Spark
- Apache Beam
- Apache Flink
- Apache Storm
- Apache Hive
- Apache PrestoDB
- dan lainnya
26. Special Thanks
Zaky & Wildan yang telah mengajari saya Apache Spark
Fajri & Tajhul yang telah mengajari saya menggunakan berbagai produk di
AWS. Bramandityo yang telah mengajari saya Python.