Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

AWS EMR + Spark ML

335 views

Published on

- Spark Basic (Based on PySpark)
- AWS EMR, YARN client mode
- Spark ML, Distributed ML Algorithm

Published in: Data & Analytics
  • Be the first to comment

AWS EMR + Spark ML

  1. 1. Kaggler를위한AWS EMR + Spark ML
  2. 2. Apache Spark, Zeppelin Spark Standalone mode Spark YARN cluster mode Spark SQL DataFrame Spark ML, MLlib Data parralell vs Computing parralell Online learning on Spark AWS Elastic MapReduce Distributed Computing AWS EMR + S3 Architecture Data partitioning, skew
  3. 3. 도대체ML에서Spark이왜필요한가? Kaggler는대부분 Sklearn, XGBoost, pandas 등의패키지를사용 딥러닝은 TensorFlow, Keras, PyTorch 를사용 하지만Kaggle 데이터는많아봐야500만줄이라는게 함정 현실은? 하루에정제된Feature가 최소500만줄 그렇다면1년이상의기간에대하여ML을돌려볼수있을까? 모델에대한파라메터튜닝은얼마나오래걸릴까? 실제서비스에실시간으로모델을서빙해야한다면?
  4. 4. Out of Memory Error 발생 resampling, pandas chunk, map, concat 등의노오력을하면어떻게든읽을 수는있겠지만, 전처리까지하려면한참걸린다...
  5. 5. Apache Spark https://spark.apache.org/ Spark is a fast and general engine for large‑scale data processing Apache Spark has an advanced DAG execution engine that supports acyclic data flow and in‑memory computing
  6. 6. Apache Spark 쉽고 빠른In‑Memory Computing을지원 Scala, Python, R에대한high‑level API를제공 범용ANSI SQL 지원(Spark SQL) Spark Streaming, ML, GraphX 등패키지를통해복잡한분석이가능
  7. 7. Runs Everywhere (Hadoop, Mesos, standalone, Cloud) Access diverse data sources (HDFS, Cassandra, HBase, and S3)
  8. 8. Spark process
  9. 9. 이제Standalone Mode에서실습해보자 마음편하게 Docker 이미지를사용(Spark 2.2, Python 3.x) https://hub.docker.com/r/jupyter/pyspark‑notebook/ $ docker pull jupyter/pyspark-notebook $ docker run -it -p 8888:8888 jupyter/pyspark-notebook # local notebook 폴더를 공유하고 싶다면 마운트 옵션 추가 -v /MyPath:/home/jovyan/notebook # password를 설정하고 싶다면 start-notebook.sh 스크립트 활용 start-notebook.sh --NotebookApp.password='sha1:blabla'
  10. 10. RDD에서DataFrame, DataSet으로변화
  11. 11. RDD Lineage, Action, Transformation RDDs track the series of transformations used to build them to recompute lost data (Fault Tolerance)
  12. 12. Spark SQL, DAGScheduler
  13. 13. PySpark SQL API 공식문서가 제일좋다(사용하고 있는버전을꼭확인할것) http://spark.apache.org/docs/latest/api/python/pyspark.sql.html https://spark.apache.org/docs/latest/sql‑programming‑guide.html Spark Properties 클러스터운영, 어플리케이션실행에따라다양한설정값을제공 driver, executor memory, core와같은옵션은중요 https://spark.apache.org/docs/latest/configuration.html https://www.slideshare.net/JunyoungPark22/spark‑config
  14. 14. 자그럼이제클러스터를구축해보자 Apache Hadoop, Spark, HBase, Hive, Flink... 일일히설치하기엔너무벅찬친구들(관리가 더힘들다는건 함정) 각 프레임워크에대한Dependency 문제는어떻게? 지속적인버전업에대한요구사항... 네트워크장애, 예기치못한장애... Auto Scaling, Provisioning...
  15. 15. AWS EMR https://aws.amazon.com/ko/emr/ Amazon EMR은관리형하둡프레임워크 동적으로확장가능한Amazon EC2 인스턴스전체에서대량의데이터를 쉽고 빠르며비용효율적으로처리 Amazon EMR에서Apache Spark, HBase, Presto 및Flink와같이널리 사용되는분산프레임워크를실행 Amazon S3 및Amazon DynamoDB와같은다른AWS 데이터스토어 의데이터와상호작용할수있습니다. Amazon EMR은로그 분석, 웹인덱싱, 데이터변환(ETL), 기계 학습등 다양한사례를안전하고 안정적으로처리
  16. 16. AWS EMR Dependency 문제는어떻게? Apache BigTop Auto Scaling, Provisioning 지원 릴리즈히스토리를확인하자 http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr‑ release‑components.html EMR의인스턴스에는start / restart 가 없음 종료하면무조건 teminate, 데이터도완전히날아감 S3에저장하는걸 추천
  17. 17. Master Node 클러스터관리, 하위그룹에대한MapReduce 실행/ 작업스케줄링 수행되는각 작업의상태를추적하고 인스턴스그룹의상태를모니터링 Hadoop 마스터노드와유사한개념(한개만존재) Core Node 작업을실행하고 HDFS를사용하여데이터를저장 Hadoop 슬레이브노드와유사 Task Node 작업을실행, Hadoop 슬레이브노드와유사 연산량이많아CPU와메모리노드만확장하고 싶은경우에사용
  18. 18. Hadoop (HDFS, YARN) 여러개의컴퓨팅플랫폼을동시에실행할경우(MR, Spark, Storm...) 각 서버의리소스가 부족하여정상적으로수행되던작업들이 다른작업에의해문제가 발생
  19. 19. ResourceManager: 클러스터전체의리소스를관리, 스케줄링 NodeManager: Container를모니터링(Health Check)
  20. 20. AWS EMR에서YARN Cluster mode로실습해보자 AWS EMR Step을활용 aws‑cli, boto3, Airflow 등을활용해서자동화가능 $ aws emr add-steps --cluster-id $CLUSTERID, --steps Name=$JOBNAME, Jar=$JARFILE, Args=[ /usr/lib/spark/bin/spark-submit, --deploy-mode,cluster, --properties-file,/etc/spark/conf/spark-defaults.conf, --conf,spark.yarn.executor.memoryOverhead=2048, --conf,spark.executor.memory=4g, --packages,$SPARK_PACKAGES ], ActionOnFailure=${ACTION_ON_FAIL}'
  21. 21. Spark ML

×