Successfully reported this slideshow.
Your SlideShare is downloading. ×

Exploring Github Data with Apache Drill on ARM64

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 27 Ad

More Related Content

Slideshows for you (20)

Similar to Exploring Github Data with Apache Drill on ARM64 (20)

Advertisement

Recently uploaded (20)

Advertisement

Exploring Github Data with Apache Drill on ARM64

  1. 1. Exploring GitHub data with Apache Drill on Arm64 Ganesh Raju Naresh Bhat
  2. 2. Who Are We Anyway?
  3. 3. What is Linaro: Leading collaboration in the ARM ecosystem
  4. 4. Apache Drill Open source distributed SQL query engine for non-relational datastores - JSON document model - Columnar Key Advantages - Columnar - Schema on the fly - Integrates with any non-relational datastore - Elastic scalability - Data can be treated like SQL Tables - SQL like query syntax - No overhead (creating and maintaining schemas, ETL process, etc ) - Vectorization (SIMD instructions)
  5. 5. Apache Drill on Arm64 Server
  6. 6. Test environment - SW basic configuration Architecture Gigabyte Marvell® ThunderX2® "Saber" 3 node cluster OS platform Debian GNU/Linux 9.9 (stretch) Linux Kernel version Debian 4.16.13.linaro.290-1 GCC version gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516 GlibC version Debian GLIBC 2.24-11+deb9u4 JAVA version openjdk version "1.8.0_191" OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_191-b12) OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.191-b12, mixed mode) Hadoop version Hadoop 2.8.5 Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 0b8464d75227fcee2c6e7f2410377b3d53d3d5f8 Compiled by jdu on 2018-09-10T03:32Z Compiled with protoc 2.5.0 Using upstream release packages from apache.org. Running on commercially available Arm server based on Marvell ThunderX2.
  7. 7. Test environment - SW basic configuration Zookeeper and libzookeeper-java version 3.4.9-3+deb9u2 Apache Drill version v1.16.0 Jupyter Notebook version Dataset 3 TB+ of github activity dataset contains a full snapshot . The content is more than 2.8 million open source GitHub repositories. Which includes more than 145 million unique commits Can replicate this demo using upstream release packages and open source data set. jupyter core 4.5.0 jupyter-notebook 6.0.1 qtconsole 4.5.4 ipython 7.7.0 ipykernel 5.1.2 jupyter client 5.3.1 jupyter lab 1.0.9 nbconvert 5.6.0 ipywidgets 7.5.1 nbformat 4.4.0 traitlets 4.3.2
  8. 8. Select * from drillbits;
  9. 9. Select files in dfs;
  10. 10. Top projects this year in Github Need to paste Apache Drill query snapshot
  11. 11. Top contributors by year Need to paste Apache Drill query snapshot
  12. 12. Top contributors to Linux by year Need to paste Apache Drill query snapshot
  13. 13. Top contributors to Bigdata (Hadoop, Spark, HBase, Hive, drill, etc) Need to paste Apache Drill query snapshot
  14. 14. Contributors by Country SELECT * FROM dfs.`/usersummary/*.json` limit 20
  15. 15. Language Popularity Score SELECT * FROM dfs.`/usersummary/*.json` limit 20
  16. 16. Top Python repositories by their commits count SELECT * FROM dfs.`/usersummary/*.json` limit 20
  17. 17. Top Apache Projects by contribution Need to paste Apache Drill query snapshot
  18. 18. Who Are We Anyway? We are Linaro: Leading collaboration in the Arm ecosystem
  19. 19. Linaro: Open Source Delivering high value collaboration Top 5 company contributor in Linux kernel Contributor to >70 open source projects; many maintained by Linaro engineers Company 4.8-4.13 Changesets % 1 Intel 10,833 13.1% 2 Red Hat 5,965 7.2% 3 Linaro 4,636 5.6% Source: Linux Kernel Development Report, Linux Foundation Selected projects Linaro contributes to
  20. 20. Linaro: BigData Objective ● Ensure that Arm is a first class platform for Hadoop and Spark. ● Profile Hadoop and Spark for real world workloads on 64-bit Arm server systems. ● Ensure that OpenJDK is running optimally against Hadoop and Spark workloads.
  21. 21. ❏ Founded in November 1990 ❏ Designs the RISC processor cores ❏ Licenses Arm core designs to partners who fabricate and sell to their customers
  22. 22. Arm Ecosystem momentum continues to accelerate www.arm.com Workloads Networking Virtualization & Containers Language & Library Operating system
  23. 23. COMPANY FOUNDED 1995 FY19 REVENUE $2.9B EMPLOYEES 6,000+ LOCATED IN Santa Clara, CA R&D CENTERS US, Israel, India, Germany, China PATENTS WORLDWIDE 10,000+ 23 Marvell © 2019 Marvell Confidential, All Rights Reserved.
  24. 24. 24© 2019 Marvell Confidential, All Rights Reserved. • Up to 32 custom Armv8.1 cores, up to 2.5GHz • Full Out-of-Order, 1, 2, 4 threads per core • 1S and 2S Configuration • Up to 8 DDR4-2667 Memory Controllers, 1 & 2 DPC • Up to 56 lanes of PCIe Gen3, 14 PCIe controllers ThunderX2 Second Generation High-End Armv8-A Server SoC
  25. 25. 25 Marvell powers the world’s fastest Arm-based supercomputer Driven by 145,152 (5,184 CPUs x 28 cores) ThunderX2 cores Securing U.S. nuclear arsenal © 2019 Marvell Confidential, All Rights Reserved.
  26. 26. Marvell-University of Michigan Partnership Built on Cavium/Marvell-Michigan relationship Deploy ThunderX for Big Data ● 4800 Cores ● 25 TB Memory ● 40 & 100 Gbps networking ● 3 PB Hadoop File System Accelerating the software ecosystem for data science for Arm. Directly consuming Linaro Big Data software builds We bring an advanced user base in the data science domain
  27. 27. Questions ? Contact Us: Ganesh Raju ganesh.raju@linaro.org Naresh Bhat nareshb@marvell.com naresh.bhat@linaro.org Blogpost https://nbhatlinaro.blogspot.com/2019/04/apache-drill-on-arm64.html Thanks to Linaro Team: Yuqi Gu Jun He Guodong Xu Inspiration from Felipe Hoffa’s talks on Google BigQuery https://s3.amazonaws.com/connect.linaro.org/bkk19/presentations/bkk19- 300k1.pdf

×