Combining the most popular and technically-deep material from his wildly popular Advanced Apache Spark Meetup, Chris Fregly will provide code-level deep dives into the latest performance and scalability advancements within the Apache Spark Ecosystem by exploring the following: 1) Building a Scalable and Performant Spark SQL/DataFrames Data Source Connector such as Spark-CSV, Spark-Cassandra, Spark-ElasticSearch, and Spark-Redshift 2) Speeding Up Spark SQL Queries using Partition Pruning and Predicate Pushdowns with CSV, JSON, Parquet, Avro, and ORC 3) Tuning Spark Streaming Performance and Fault Tolerance with KafkaRDD and KinesisRDD 4) Maintaining Stability during High Scale Streaming Ingestion using Approximations and Probabilistic Data Structures from Spark, Redis, and Twitter's Algebird 5) Building Effective Machine Learning Models using Feature Engineering, Dimension Reduction, and Natural Language Processing with MLlib/GraphX, ML Pipelines, DIMSUM, Locality Sensitive Hashing, and Stanford's CoreNLP 6) Tuning Core Spark Performance by Acknowledging Mechanical Sympathy for the Physical Limitations of OS and Hardware Resources such as CPU, Memory, Network, and Disk with Project Tungsten, Asynchronous Netty, and Linux epoll * Demos * This talk features many interesting and audience-interactive demos - as well as code-level deep dives into many of the projects listed above. All demo code is available on Github at the following link: https://github.com/fluxcapacitor/pipeline/wiki In addition, the entire demo environment has been Dockerized and made available for download on Docker Hub at the following link: https://hub.docker.com/r/fluxcapacitor/pipeline/ * Speaker Bio * Chris Fregly is a Principal Data Solutions Engineer for the newly-formed IBM Spark Technology Center, an Apache Spark Contributor, a Netflix Open Source Committer, as well as the Organizer of the global Advanced Apache Spark Meetup and Author of the Upcoming Book, Advanced Spark. Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix. When Chris isn’t contributing to Spark and other open source projects, he’s creating book chapters, slides, and demos to share knowledge with his peers at meetups and conferences throughout the world.