Compression Options in Hadoop - A Tale of TradeoffsDataWorks Summit
Yahoo! is one of the most-visited web sites in the world. It runs one of the largest private cloud infrastructures, one that operates on petabytes of data every day. Being able to store and manage that data well is essential to the efficient functioning of Yahoo!`s Hadoop clusters. A key component that enables this efficient operation is data compression. With regard to compression algorithms, there is an underlying tension between compression ratio and compression performance. Consequently, Hadoop provides support for several compression algorithms, including gzip, bzip2, Snappy, LZ4 and others. This plethora of options can make it difficult for users to select appropriate codecs for their MapReduce jobs. This paper attempts to provide guidance in that regard. Performance results with Gridmix and with several corpuses of data are presented. The paper also describes enhancements we have made to the bzip2 codec that improve its performance. This will be of particular interest to the increasing number of users operating on “Big Data” who require the best possible ratios. The impact of using the Intel IPP libraries is also investigated; these have the potential to improve performance significantly. Finally, a few proposals for future enhancements to Hadoop in this area are outlined.
Cloud dw benchmark using tpd-ds( Snowflake vs Redshift vs EMR Hive )SANG WON PARK
몇년 전부터 Data Architecture의 변화가 빠르게 진행되고 있고,
그 중 Cloud DW는 기존 Data Lake(Hadoop 기반)의 한계(성능, 비용, 운영 등)에 대한 대안으로 주목받으며,
많은 기업들이 이미 도입했거나, 도입을 검토하고 있다.
본 자료는 이러한 Cloud DW에 대해서 개념적으로 이해하고,
시장에 존재하는 다양한 Cloud DW 중에서 기업의 환경에 맞는 제품이 어떤 것인지 성능/비용 관점으로 비교했다.
- 왜기업들은 CloudDW에주목하는가?
- 시장에는어떤 제품들이 있는가?
- 우리Biz환경에서는 어떤 제품을 도입해야 하는가?
- CloudDW솔루션의 성능은?
- 기존DataLake(EMR)대비 성능은?
- 유사CloudDW(snowflake vs redshift) 대비성능은?
앞으로도 Data를 둘러싼 시장은 Cloud DW를 기반으로 ELT, Mata Mesh, Reverse ETL등 새로운 생테계가 급속하게 발전할 것이고,
이를 위한 데이터 엔지니어/데이터 아키텍트 관점의 기술적 검토와 고민이 필요할 것 같다.
https://blog.naver.com/freepsw/222654809552
ScyllaDB recently launched our Scylla Cloud database as a service, which combines the speed and power of the Scylla NoSQL database with the ease of a fully managed cloud service. Scylla Cloud relieves your team of day-to-day cluster management so you can focus on creating modern, interactive applications that respond to queries in milliseconds.
Join us for an overview of Scylla Cloud, including a live demo of how to launch and connect to a cluster, how to create and query a table, and how to run a few operations, all in minutes.
Compression Options in Hadoop - A Tale of TradeoffsDataWorks Summit
Yahoo! is one of the most-visited web sites in the world. It runs one of the largest private cloud infrastructures, one that operates on petabytes of data every day. Being able to store and manage that data well is essential to the efficient functioning of Yahoo!`s Hadoop clusters. A key component that enables this efficient operation is data compression. With regard to compression algorithms, there is an underlying tension between compression ratio and compression performance. Consequently, Hadoop provides support for several compression algorithms, including gzip, bzip2, Snappy, LZ4 and others. This plethora of options can make it difficult for users to select appropriate codecs for their MapReduce jobs. This paper attempts to provide guidance in that regard. Performance results with Gridmix and with several corpuses of data are presented. The paper also describes enhancements we have made to the bzip2 codec that improve its performance. This will be of particular interest to the increasing number of users operating on “Big Data” who require the best possible ratios. The impact of using the Intel IPP libraries is also investigated; these have the potential to improve performance significantly. Finally, a few proposals for future enhancements to Hadoop in this area are outlined.
Cloud dw benchmark using tpd-ds( Snowflake vs Redshift vs EMR Hive )SANG WON PARK
몇년 전부터 Data Architecture의 변화가 빠르게 진행되고 있고,
그 중 Cloud DW는 기존 Data Lake(Hadoop 기반)의 한계(성능, 비용, 운영 등)에 대한 대안으로 주목받으며,
많은 기업들이 이미 도입했거나, 도입을 검토하고 있다.
본 자료는 이러한 Cloud DW에 대해서 개념적으로 이해하고,
시장에 존재하는 다양한 Cloud DW 중에서 기업의 환경에 맞는 제품이 어떤 것인지 성능/비용 관점으로 비교했다.
- 왜기업들은 CloudDW에주목하는가?
- 시장에는어떤 제품들이 있는가?
- 우리Biz환경에서는 어떤 제품을 도입해야 하는가?
- CloudDW솔루션의 성능은?
- 기존DataLake(EMR)대비 성능은?
- 유사CloudDW(snowflake vs redshift) 대비성능은?
앞으로도 Data를 둘러싼 시장은 Cloud DW를 기반으로 ELT, Mata Mesh, Reverse ETL등 새로운 생테계가 급속하게 발전할 것이고,
이를 위한 데이터 엔지니어/데이터 아키텍트 관점의 기술적 검토와 고민이 필요할 것 같다.
https://blog.naver.com/freepsw/222654809552
ScyllaDB recently launched our Scylla Cloud database as a service, which combines the speed and power of the Scylla NoSQL database with the ease of a fully managed cloud service. Scylla Cloud relieves your team of day-to-day cluster management so you can focus on creating modern, interactive applications that respond to queries in milliseconds.
Join us for an overview of Scylla Cloud, including a live demo of how to launch and connect to a cluster, how to create and query a table, and how to run a few operations, all in minutes.
Flink vs. Spark: this is the slide deck of my talk at the 2015 Flink Forward conference in Berlin, Germany, on October 12, 2015. In this talk, we tried to compare Apache Flink vs. Apache Spark with focus on real-time stream processing. Your feedback and comments are much appreciated.
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud (Hadoop / Spark Conference Japan 2019)
# English version #
http://hadoop.apache.jp/hcj2019-program/
Apache Knox Gateway "Single Sign On" expands the reach of the Enterprise UsersDataWorks Summit
Apache Knox Gateway is a proxy for interacting with Apache Hadoop clusters in a secure way providing authentication, service level authorization, and many other extensions to secure any HTTP interactions in your cluster. One main feature of Apache Knox Gateway is the ability to extend the reach of your REST APIs to the internet while still securing your cluster and working with Kerberos. Recent contributions to the Apache Knox community have added support for Single Sign On (SSO) based on Pac4j 1.8.9 which is a very powerful security engine which provides SSO support through SAML2, OAuth, OpenID, and CAS. In addition, through recent community contributions Apache Ambari, and Apache Ranger can now also provide SSO authentication through Knox. This paper will discuss the architecture of Knox SSO, it will explain how enterprise user could benefit by this feature and will present enterprise use cases for Knox SSO, and integration with open source Shibboleth, ADFS Windows server Idp support, and Okta cloud Idp.
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements.
1) Generality: support reading/writing most data management/storage systems.
2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities.
Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility.
Interested in learning Hadoop, but you’re overwhelmed by the number of components in the Hadoop ecosystem? You’d like to get some hands on experience with Hadoop but you don’t know Linux or Java? This session will focus on giving a high level explanation of Hive and HiveQL and how you can use them to get started with Hadoop without knowing Linux or Java.
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
This talk outlines data lake design patterns that can yield massive performance gains for all downstream consumers. We will talk about how to optimize Parquet data lakes and the awesome additional features provided by Databricks Delta. * Optimal file sizes in a data lake * File compaction to fix the small file problem * Why Spark hates globbing S3 files * Partitioning data lakes with partitionBy * Parquet predicate pushdown filtering * Limitations of Parquet data lakes (files aren't mutable!) * Mutating Delta lakes * Data skipping with Delta ZORDER indexes
Speaker: Matthew Powers
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3DataWorks Summit
The Hadoop community announced Hadoop 3.0 GA in December, 2017 and 3.1 around April, 2018 loaded with a lot of features and improvements. One of the biggest challenges for any new major release of a software platform is its compatibility. Apache Hadoop community has focused on ensuring wire and binary compatibility for Hadoop 2 clients and workloads.
There are many challenges to be addressed by admins while upgrading to a major release of Hadoop. Users running workloads on Hadoop 2 should be able to seamlessly run or migrate their workloads onto Hadoop 3. This session will be deep diving into upgrade aspects in detail and provide a detailed preview of migration strategies with information on what works and what might not work. This talk would focus on the motivation for upgrading to Hadoop 3 and provide a cluster upgrade guide for admins and workload migration guide for users of Hadoop.
Speaker
Suma Shivaprasad, Hortonworks, Staff Engineer
Rohith Sharma, Hortonworks, Senior Software Engineer
빅데이터 개념 부터 시작해서 빅데이터 분석 플랫폼의 출현(hadoop)과 스파크의 등장배경까지 풀어서 작성된 spark 소개 자료 입니다.
스파크는 RDD에 대한 개념과 spark SQL 라이브러리에 대한 자료가 조금 자세히 설명 되어있습니다. (텅스텐엔진, 카탈리스트 옵티마이져에 대한 간략한 설명이 있습니다.)
마지막에는 간단한 설치 및 interactive 분석 실습자료가 포함되어 있습니다.
원본 ppt 를 공개해 두었으니 언제 어디서든 필요에 따라 변형하여 사용하시되 출처만 잘 남겨주시면 감사드리겠습니다.
다른 슬라이드나, 블로그에서 사용된 그림과 참고한 자료들은 작게 출처를 표시해두었는데, 본 ppt의 초기버전을 작성하면서 찾았던 일부 자료들은 출처가 불분명한 상태입니다. 자료 출처를 알려주시면 반영하여 수정해 두도록하겠습니다. (제보 부탁드립니다!)
Flink vs. Spark: this is the slide deck of my talk at the 2015 Flink Forward conference in Berlin, Germany, on October 12, 2015. In this talk, we tried to compare Apache Flink vs. Apache Spark with focus on real-time stream processing. Your feedback and comments are much appreciated.
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud (Hadoop / Spark Conference Japan 2019)
# English version #
http://hadoop.apache.jp/hcj2019-program/
Apache Knox Gateway "Single Sign On" expands the reach of the Enterprise UsersDataWorks Summit
Apache Knox Gateway is a proxy for interacting with Apache Hadoop clusters in a secure way providing authentication, service level authorization, and many other extensions to secure any HTTP interactions in your cluster. One main feature of Apache Knox Gateway is the ability to extend the reach of your REST APIs to the internet while still securing your cluster and working with Kerberos. Recent contributions to the Apache Knox community have added support for Single Sign On (SSO) based on Pac4j 1.8.9 which is a very powerful security engine which provides SSO support through SAML2, OAuth, OpenID, and CAS. In addition, through recent community contributions Apache Ambari, and Apache Ranger can now also provide SSO authentication through Knox. This paper will discuss the architecture of Knox SSO, it will explain how enterprise user could benefit by this feature and will present enterprise use cases for Knox SSO, and integration with open source Shibboleth, ADFS Windows server Idp support, and Okta cloud Idp.
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements.
1) Generality: support reading/writing most data management/storage systems.
2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities.
Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility.
Interested in learning Hadoop, but you’re overwhelmed by the number of components in the Hadoop ecosystem? You’d like to get some hands on experience with Hadoop but you don’t know Linux or Java? This session will focus on giving a high level explanation of Hive and HiveQL and how you can use them to get started with Hadoop without knowing Linux or Java.
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
This talk outlines data lake design patterns that can yield massive performance gains for all downstream consumers. We will talk about how to optimize Parquet data lakes and the awesome additional features provided by Databricks Delta. * Optimal file sizes in a data lake * File compaction to fix the small file problem * Why Spark hates globbing S3 files * Partitioning data lakes with partitionBy * Parquet predicate pushdown filtering * Limitations of Parquet data lakes (files aren't mutable!) * Mutating Delta lakes * Data skipping with Delta ZORDER indexes
Speaker: Matthew Powers
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3DataWorks Summit
The Hadoop community announced Hadoop 3.0 GA in December, 2017 and 3.1 around April, 2018 loaded with a lot of features and improvements. One of the biggest challenges for any new major release of a software platform is its compatibility. Apache Hadoop community has focused on ensuring wire and binary compatibility for Hadoop 2 clients and workloads.
There are many challenges to be addressed by admins while upgrading to a major release of Hadoop. Users running workloads on Hadoop 2 should be able to seamlessly run or migrate their workloads onto Hadoop 3. This session will be deep diving into upgrade aspects in detail and provide a detailed preview of migration strategies with information on what works and what might not work. This talk would focus on the motivation for upgrading to Hadoop 3 and provide a cluster upgrade guide for admins and workload migration guide for users of Hadoop.
Speaker
Suma Shivaprasad, Hortonworks, Staff Engineer
Rohith Sharma, Hortonworks, Senior Software Engineer
빅데이터 개념 부터 시작해서 빅데이터 분석 플랫폼의 출현(hadoop)과 스파크의 등장배경까지 풀어서 작성된 spark 소개 자료 입니다.
스파크는 RDD에 대한 개념과 spark SQL 라이브러리에 대한 자료가 조금 자세히 설명 되어있습니다. (텅스텐엔진, 카탈리스트 옵티마이져에 대한 간략한 설명이 있습니다.)
마지막에는 간단한 설치 및 interactive 분석 실습자료가 포함되어 있습니다.
원본 ppt 를 공개해 두었으니 언제 어디서든 필요에 따라 변형하여 사용하시되 출처만 잘 남겨주시면 감사드리겠습니다.
다른 슬라이드나, 블로그에서 사용된 그림과 참고한 자료들은 작게 출처를 표시해두었는데, 본 ppt의 초기버전을 작성하면서 찾았던 일부 자료들은 출처가 불분명한 상태입니다. 자료 출처를 알려주시면 반영하여 수정해 두도록하겠습니다. (제보 부탁드립니다!)
자바스크립트로 라즈베리파이 제어 및 리모콘을 만들어 IoT를 실습하는 방법을 다루고 있습니다. 관련 목차는 다음과 같습니다.
0. 시작하기 앞서 (4p)
1. 파이를 동작시켜 보자 - 초기설정 (17p)
2. 파이의 운영체제 - Linux 속성실습 (33p)
3. JavaScript로 하드웨어 제어를 - Node.JS (57p)
4. 거리 측정하고 정보 표시 하기 - GPIO (81p)
5. 스마트폰으로 리모콘을 - jQueryMobile (119p)
6. 인터넷으로 음악과 날씨를 - OpenAPI & RSS (137p)
7. 오디오 소프트웨어 개발하기 - (153p)
8. 외관을 생각대로 만들기 - 3D 프린팅 (189p)
9. 언제 어디서나 동작하는 IoT- Circulus (192p)
10. 마무리(256p)
spark 1.6을 기준으로 spark sql에 대해서 개략적으로 설명한 자료입니다. 발표 자료가 친절하지 않으나 한글로 된 자료가 없길래 혹시나 도움 되시는 분들이 있을까 하여 공유합니다.
발표자료 보다는 마지막 페이지의 참고자료들을 읽어보시기를 권장 드립니다.
출처만 남겨주시면 자유롭게 가져가셔서 사용하셔도 무방합니다.
코드로 바로 해버리는 서버리스 오케스트레이션 - Azure Durable Functions
Zeppelin(Spark)으로 데이터 분석하기
1. Zeppelin (powered by Apache Spark)
으로 데이터 분석하기
2014-11-05
스사모 (한국 스파크 사용자 모임)
https://www.facebook.com/groups/sparkkoreauser/
!
김상우, VCNC(비트윈)
sangwookim.me@gmail.com
2. Apache Spark?
• MapReduce 와 유사한 작업이 가능
• 확장성 (Spark SQL, Spark Streaming, MLLib, GraphX)
• MapReduce보다 훨씬 간단한 인터페이스, 배우기 쉬움
(Scala, REPL)
• 작업 종류에 따라 MapReduce의 5배~50배 빠름 (In-
Memory Data)
• Hadoop Storage 호환 (HDFS, HBase, S3, ..)
3. 왜 필요한가?
• MapReduce, Hive (기존의 지배 기술들)
• 매우 강력하지만, 작업이 복잡할수록 비효율적이다. (중
간 결과를 계속해서 HDFS에 저장)
• API가 복잡하고, MR Job 여러개를 Chaining해서 작업
을 만들어놓으면, 유지보수하기가 어렵다.
4. Spark Key Concept
• RDD (Resilient Distributed Datasets)
‣ 클러스터 전체에서 공유되는 리스트, 메모리상에 올라가있음. (메모리 부족한 경우,
디스크에 spill)
‣ map, reduce, count, filter, join 등 다양한 작업 가능
‣ 여러 작업을 설정해두고, 결과를 얻을 때 lazy하게 계산
• Scala
‣ 데이터 분석 하기에 아주 좋은 언어
‣ 강력한 expression, Java와의 호환성
‣ Interactive Shell (REPL)
5. Spark은 좋다
• 수십대의 Hadoop Cluster로 큰 작업을 돌려야 했던 경우,
10대 이하의 Cluster로 대체할 수 있다
• 클러스터로 돌려야 하던 작업을 1~2대로 돌릴 수 있다
• 수십분 기다려야 하던 작업이 1분만에 완료된다
• MR 작업 코드 만들고, 패키징하고, submit하고 하던 복잡
한 과정이, shell에서 코드 한줄 치는것으로 대체된다
• 처음 접하는 사람도 배우기 쉽다
9. Getting Download Data
case class CloudFrontPcVerChart(val date: String, val country: String, val ip:
String, val http_method: String, val ua: String)
val cloudFrontPcVerLogs = "s3n://assets-between-pc-logs/*2014-10-*"
val cloudFrontPcVerDownloadLogs =
sc.textFile(cloudFrontPcVerLogs).filter(_ contains "/downloads/
setup.exe").map(x => x.split("t"))
cloudFrontPcVerDownloadLogs.first
val cloudFrontPcVerDownloadChart =
cloudFrontPcVerDownloadLogs.map(arr => CloudFrontPcVerChart(arr(0),
IP2C.get(arr(4)), arr(4), arr(5), arr(10)))
cloudFrontPcVerDownloadChart.registerAsTable("pc_ver_download")
10. Querying Data
select country, count(1) value
from pc_ver_download
group by country
order by value desc
limit 10
Simple enough!
13. Zeppelin
• A web-based notebook for Apache Spark (http://zeppelin-project.
org)
• Open source (https://github.com/NFLabs/zeppelin)
14. Zeppelin
• Early stage 프로젝트 (Github 50 Star)
• 1~2년 사이에 엄청 유명해질 프로젝트
• 10줄만 커밋해도 contributor 로 넣어주는 좋은 프로젝트
• 쉬운 설치, 실행하면 Spark을 내부에서 띄워줌 (외부
Cluster와 연결도 가능)
20. Spark SQL과 결합하여 Visualisation 툴로도 높은 가능성
Live Demo를 Keynote에 넣기가 어려워 스크린샷으로 대체합니다
21. Spark SQL과 결합하여 Visualisation 툴로도 높은 가능성
Live Demo를 Keynote에 넣기가 어려워 스크린샷으로 대체합니다
22. Spark SQL과 결합하여 Visualisation 툴로도 높은 가능성
간단한 SQL Query로 대쉬보드를
순식간에 만듬
Live Demo를 Keynote에 넣기가 어려워 스크린샷으로 대체합니다
23. Spark SQL과 결합하여 Visualisation 툴로도 높은 가능성
위치, 넓이 등 조절
Live Demo를 Keynote에 넣기가 어려워 스크린샷으로 대체합니다
24. Zeppelin
• 간단하게 데이터 분석을 시작해보려는 사람들에게 추천
• 민첩하게 이런저런 데이터를 살펴보고 분석하려는 사람들에게 추
천
• Dashboard을 빠르게 만들고 싶은 사람들에게 추천
• Hot한 Open Source에 참여해보고 싶은 사람들에게 추천
• Spark을 처음 사용하는 경우는 Spark Shell을 먼저 사용해보는것
을 추천 (Zeppelin Code Editor의 Auto Completion기능이 보강될 때 까지)