Apache Spark v3 is a new milestone for the Big Data framework. In this session, you will (re)discover what Spark is, learn about the new features in its third major version, and go through a complete end-to-end project.
I like to call Spark an Analytics Operating Systems. It is offering far more than just a framework or a library. I will explain why. Spark v3 is the latest major evolution. It was released mid-June 2020 and adds impressive new features. After looking at them from a high level, I will detail a few of my favorites.
Finally, as we all like code (well, at least I do), I will demonstrate a complete data & AI pipeline looking at Covid-19 data.
Key takeaways: Spark as an Analytics OS, Spark v3 highlights, building data/AI pipelines/models with Spark.
Audience: software engineers, data engineers, architects, data scientists.
Organizations continue to adopt Solr because of its ability to scale to meet even the most demanding workflows. Recently, LucidWorks has been leading the effort to identify, measure, and expand the limits of Solr. As part of this effort, we've learned a few things along the way that should prove useful for any organization wanting to scale Solr. Attendees will come away with a better understanding of how sharding and replication impact performance. Also, no benchmark is useful without being repeatable; Tim will also cover how to perform similar tests using the Solr-Scale-Toolkit in Amazon EC2.
A Developer’s View into Spark's Memory Model with Wenchen FanDatabricks
As part of Project Tungsten, we started an ongoing effort to substantially improve the memory and CPU efficiency of Apache Spark’s backend execution and push performance closer to the limits of modern hardware. In this talk, we’ll take a deep dive into Apache Spark’s unified memory model and discuss how Spark exploits memory hierarchy and leverages application semantics to manage memory explicitly (both on and off-heap) to eliminate the overheads of JVM object model and garbage collection.
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...thelabdude
My presentation focuses on how we implemented Solr 4 to be the cornerstone of our social marketing analytics platform. Our platform analyzes relationships, behaviors, and conversations between 30,000 brands and 100M social accounts every 15 minutes. Combined with our Hadoop cluster, we have achieved throughput rates greater than 8,000 documents per second. Our index currently contains more than 620M documents and is growing by 3 to 4 million documents per day. My presentation will include details about: 1) Designing a Solr Cloud cluster for scalability and high-availability using sharding and replication with Zookeeper, 2) Operations concerns like how to handle a failed node and monitoring, 3) How we deal with indexing big data from Pig/Hadoop as an example of using the CloudSolrServer in SolrJ and managing searchers for high indexing throughput, 4) Example uses of key features like real-time gets, atomic updates, custom hashing, and distributed facets. Attendees will come away from this presentation with a real-world use case that proves Solr 4 is scalable, stable, and is production ready.
Organizations continue to adopt Solr because of its ability to scale to meet even the most demanding workflows. Recently, LucidWorks has been leading the effort to identify, measure, and expand the limits of Solr. As part of this effort, we've learned a few things along the way that should prove useful for any organization wanting to scale Solr. Attendees will come away with a better understanding of how sharding and replication impact performance. Also, no benchmark is useful without being repeatable; Tim will also cover how to perform similar tests using the Solr-Scale-Toolkit in Amazon EC2.
A Developer’s View into Spark's Memory Model with Wenchen FanDatabricks
As part of Project Tungsten, we started an ongoing effort to substantially improve the memory and CPU efficiency of Apache Spark’s backend execution and push performance closer to the limits of modern hardware. In this talk, we’ll take a deep dive into Apache Spark’s unified memory model and discuss how Spark exploits memory hierarchy and leverages application semantics to manage memory explicitly (both on and off-heap) to eliminate the overheads of JVM object model and garbage collection.
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...thelabdude
My presentation focuses on how we implemented Solr 4 to be the cornerstone of our social marketing analytics platform. Our platform analyzes relationships, behaviors, and conversations between 30,000 brands and 100M social accounts every 15 minutes. Combined with our Hadoop cluster, we have achieved throughput rates greater than 8,000 documents per second. Our index currently contains more than 620M documents and is growing by 3 to 4 million documents per day. My presentation will include details about: 1) Designing a Solr Cloud cluster for scalability and high-availability using sharding and replication with Zookeeper, 2) Operations concerns like how to handle a failed node and monitoring, 3) How we deal with indexing big data from Pig/Hadoop as an example of using the CloudSolrServer in SolrJ and managing searchers for high indexing throughput, 4) Example uses of key features like real-time gets, atomic updates, custom hashing, and distributed facets. Attendees will come away from this presentation with a real-world use case that proves Solr 4 is scalable, stable, and is production ready.
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...StampedeCon
Learn how to model beyond traditional direct access in Apache Cassandra. Utilizing the DataStax platform to harness the power of Spark and Solr to perform search, analytics, and complex operations in place on your Cassandra data!
Cross Datacenter Replication aka CDCR has been a long requested feature in Apache Solr. In this talk, we will discuss CDCR as released in Apache Solr 6.0 and beyond to understand its use-cases, limitations, setup and performance. We will also take a quick look at the future enhancements that can further simplify and scale this feature.
ElasticES-Hadoop: Bridging the world of Hadoop and ElasticsearchMapR Technologies
In this talk, we will provide an overview of Elasticsearch for Apache Hadoop (ES-Hadoop), which includes integrations between the various Hadoop libraries, whether batch (Map/Reduce, Pig, Hive) or stream oriented (such as Apache Spark). We will also cover the YARN support and the HDFS snapshot/restore plugin available as part of ES-Hadoop. We will talk about the upcoming ES-Hadoop 2.1 GA release and near-term roadmap.
Owning time series with team apache Strata San Jose 2015Patrick McFadin
Break out your laptops for this hands-on tutorial is geared around understanding the basics of how Apache Cassandra stores and access time series data. We’ll start with an overview of how Cassandra works and how that can be a perfect fit for time series. Then we will add in Apache Spark as a perfect analytics companion. There will be coding as a part of the hands on tutorial. The goal will be to take a example application and code through the different aspects of working with this unique data pattern. The final section will cover the building of an end-to-end data pipeline to ingest, process and store high speed, time series data.
DataSource V2 and Cassandra – A Whole New WorldDatabricks
Data Source V2 has arrived for the Spark Cassandra Connector, but what does this mean for you? Speed, Flexibility and Usability improvements abound and we’ll walk you through some of the biggest highlights and how you can take advantage of them today.
An overview of building and serving Lucene indexes on a Hadoop cluster with Solr for text and parametric searching, as presented at Cleveland Hadoop User Group on 13 January 2014.
From Eric Baldeschwieler's presentation "Hadoop @ Yahoo! - Internet Scale Data Processing" at the 2009 Cloud Computing Expo in Santa Clara, CA, USA. Here's the talk description on the Expo's site: http://cloudcomputingexpo.com/event/session/509
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...StampedeCon
Learn how to model beyond traditional direct access in Apache Cassandra. Utilizing the DataStax platform to harness the power of Spark and Solr to perform search, analytics, and complex operations in place on your Cassandra data!
Cross Datacenter Replication aka CDCR has been a long requested feature in Apache Solr. In this talk, we will discuss CDCR as released in Apache Solr 6.0 and beyond to understand its use-cases, limitations, setup and performance. We will also take a quick look at the future enhancements that can further simplify and scale this feature.
ElasticES-Hadoop: Bridging the world of Hadoop and ElasticsearchMapR Technologies
In this talk, we will provide an overview of Elasticsearch for Apache Hadoop (ES-Hadoop), which includes integrations between the various Hadoop libraries, whether batch (Map/Reduce, Pig, Hive) or stream oriented (such as Apache Spark). We will also cover the YARN support and the HDFS snapshot/restore plugin available as part of ES-Hadoop. We will talk about the upcoming ES-Hadoop 2.1 GA release and near-term roadmap.
Owning time series with team apache Strata San Jose 2015Patrick McFadin
Break out your laptops for this hands-on tutorial is geared around understanding the basics of how Apache Cassandra stores and access time series data. We’ll start with an overview of how Cassandra works and how that can be a perfect fit for time series. Then we will add in Apache Spark as a perfect analytics companion. There will be coding as a part of the hands on tutorial. The goal will be to take a example application and code through the different aspects of working with this unique data pattern. The final section will cover the building of an end-to-end data pipeline to ingest, process and store high speed, time series data.
DataSource V2 and Cassandra – A Whole New WorldDatabricks
Data Source V2 has arrived for the Spark Cassandra Connector, but what does this mean for you? Speed, Flexibility and Usability improvements abound and we’ll walk you through some of the biggest highlights and how you can take advantage of them today.
An overview of building and serving Lucene indexes on a Hadoop cluster with Solr for text and parametric searching, as presented at Cleveland Hadoop User Group on 13 January 2014.
From Eric Baldeschwieler's presentation "Hadoop @ Yahoo! - Internet Scale Data Processing" at the 2009 Cloud Computing Expo in Santa Clara, CA, USA. Here's the talk description on the Expo's site: http://cloudcomputingexpo.com/event/session/509
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...HostedbyConfluent
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Buesing | Current 2022
Businesses need to react to results immediately; to achieve this, real-time processing is becoming a requirement in many analytic verticals. But sometimes, the move from batch to real-time can leave you in a pinch. How do you handle and correct mistakes in your data? How do you migrate a new system to real-time along with historical data?
Let’s start with how to run Apache Druid locally with your containerized-based development environment. While streaming real-time events from Kafka into Druid, an S3 Complaint Store captures messages via Kafka Connect, for historical processing. An exploration of performance implications when the real-time stream of events contains historical data and how that affects performance and the techniques to prevent those issues, leaving a high-performance analytic platform supporting real-time and historical processing.
You’ll leave with the tools of doing real-time analytic processing and historical batch processing from a single source of truth. Your Druid cluster will have better rollups (pre-computed aggregates) and fewer segments, which reduces cost and improves query performance.
Implementing SharePoint on Azure, Lessons Learnt from a Real World ProjectK.Mohamed Faizal
Infrastructure as a Service (IaaS) and its features that can be leveraged for hosting a SharePoint 2013 farm. Learn how to setup, thinks to consider when you setup VPN, Storage, Cloud Services, setting up load balance endpoints. The speaker will share his real world experience and trips and tricks
* Use cases of MySQL as well as edge cases of MySQL topologies using real-life examples and "war" stories
* How scalability and proxy wars make MySQL topologies more robust to serve webscale shops
* Open-source tools, utilities, and surrounding MySQL Ecosystem.
Emerging technologies /frameworks in Big DataRahul Jain
A short overview presentation on Emerging technologies /frameworks in Big Data covering Apache Parquet, Apache Flink, Apache Drill with basic concepts of Columnar Storage and Dremel.
Description of some of the elements that go in to creating a PostgreSQL-as-a-Service for organizations with many teams and a diverse ecosystem of applications and teams.
Drilling Cyber Security Data With Apache DrillCharles Givre
This deck walks you through using Apache Drill and Apache Superset (Incubating) to explore cyber security datasets including PCAP, HTTPD log files, Syslog and more.
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New FeaturesAmazon Web Services
Learn the specifics of Amazon RDS for PostgreSQL’s capabilities and extensions that make it powerful. This session begins with a brief overview of the RDS PostgreSQL service, how it provides High Availability & Durability and will then deep dive into the new features that we have released since re:Invent 2014, including major version upgrade and newly added PostgreSQL extensions to RDS PostgreSQL. During the session, we will also discuss lessons learned running a large fleet of PostgreSQL instances, including specific recommendations. In addition we will present benchmarking results looking at differences between the 9.3, 9.4 and 9.5 releases.
SQL on Hadoop benchmarks using TPC-DS query setKognitio
Sharon Kirkham, VP Analytics & Consulting at Kognitio, ran the TPC-DS query set using Impala, SparkSQL and Kognitio, to test for speed, reliability and concurrency for different SQL on Hadoop solutions. Standard Hive was originally investigated as part of this benchmark but lack of SQL support and poor single thread performance meant it was removed.
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica SarbuNETWAYS
Monica ist Mit-Schöpferin von Elastic Beats. Bevor sie Beats erfand, arbeitete sie als Core Developer für IPTEGO, einem Start-Up Unternehmen aus Berlin, das eine komplette Monitoring und Trouble-Shooting Solution für VoIP Netzwerke anbietet. Das Produkt wurde weltweit verkauft, und wird derzeit von großen Firmen der Telekommunikationsbranche verwendet.
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica SarbuNETWAYS
Beats sind eine freundliche Armee von leichtgewichtigen Agenten die, wenn sie auf dem Server installiert sind, Betriebsdaten erfassen und sie zur Analyse an Elasticsearch senden.
Sie sammeln die Logdaten ihrer Server und erhalten so Statistiken von CPU, Disk- und Speicherauslastung. Durch regelmäßige Abfragen sammeln sie Metriken von externen Systemen wie MySQL, Docker und Zookeeper und können die Kommunikation zwischen den Servern durch sniffen der entsprechenden Netzwerkverbindungen visualisieren.
Dieser Vortrag erläutert wie Sie Beats mit Elasticsearch und Kibana in einer kompletten Open Source Monitoring Lösung kombinieren können und sie ihnen helfen ihre verzweigte Infrastruktur zu überwachen und Fehler zu beheben.
Streaming ETL - from RDBMS to Dashboard with KSQLBjoern Rost
Apache Kafka is a massively scalable message queue that is being used at more and more places connecting more and more data sources. This presentation will introduce Kafka from the perspective of a mere mortal DBA and share the experience of (and challenges with) getting events from the database to Kafka using Kafka connect including poor-man’s CDC using flashback query and traditional logical replication tools. To demonstrate how and why this is a good idea, we will build an end-to-end data processing pipeline. We will discuss how to turn changes in database state into events and stream them into Apache Kafka. We will explore the basic concepts of streaming transformations using windows and KSQL before ingesting the transformed stream in a dashboard application.
An introduction to data engineering & data science using Apache Spark and Java.
Get Spark in Action 2e, at http://jgp.ai/sia.
In this presentation, I start by loading a few CSV files in Spark (ingestion) and displaying them through the help of this new tool I build, dṛṣṭi.
As you can expect, I clean the data, join it, transform it, and continue to visualize it through dṛṣṭi.
I use Delta Lake to create a cache for my data and explain what imputation is and show I can use imputation on my datasets to add the missing datapoints.
I then use Spark on simple linear regressions to predict/forecast data.
dṛṣṭi is open source (Apache 2 license) and is available at: https://github.com/jgperrin/ai.jgp.drsti.
All the labs are available at https://github.com/jgperrin/ai.jgp.drsti-spark.
Big data made easy with a Spark is the presentation I did for Opensource 101 in Columbia, SC, on April 18th 2019. It is a hands-on tutorial on Apache Spark with Java, walking through 3 different labs.
"Big Data made easy with a Spark" is the presentation I gave for ATO (AllThingsOpen) 2018.
In this hands-on session, you will learn how to do a full Big Data scenario from ingestion to publication. You will see how we can use Java and Apache Spark to ingest data, perform some transformations, save the data. You will then perform a second lab where you will run your very first Machine Learning algorithm!
Those slides were used for NC Tech's lunch and learn on Aug. 22 2018.
This lunch and learn, hosted by Veracity Solutions, you will learn how Spark can help your business build a pragmatic technology roadmap to AI (Artificial Intelligence), Machine Learning, and Big Data analytics. Apache Spark is a wonderful platform for distributed data processing and analytics, but how is it used by different organizations? How difficult is it to on-board a team, what technology do they need to master before on-boarding, do they have to master Scala or simply use their Java skills? You will find answers to those questions, get a realistic perspective on the platform, and see code (because we are all a bit geeks, right?)
Full link to the event: https://www.nctech.org/events/event/2018/lunch-and-learn-august22.html.
Spark Summit Europe Wrap Up and TASM State of the CommunityJean-Georges Perrin
On 12/12, we held our Spark meetup at IBM, called Winter 3x30. Those are the slides I used for both introducing the state of our community, TASM (Triangle Apache Spark Meetup) as well as a Spark Summit Europe Wrap Up.
I strongly believe in the combination of Apache Spark with Java. In this tutorial, prepared for NCDevCon, we are going through the basics of Spark as well as 2 examples: a basic ingestion and an analytics example based on joins & group by. Follow me @jgperrin.
As I went to Spark Summit in San Francisco, early June, I wanted to share key takeaways from the conference with my local friends of the Triangle Apache Spark Meetup.
Used for teaching HTML to middle school children (6th, 7th, and 8th graders) in a "game way" with some immediate gratification. Feedback much appreciated: jgp@jgp.net.
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...Jean-Georges Perrin
On July 9th 2015, 2CRSI announced its latest storage system: 2U24NVMe, which features 24 NVMe SSD drives, which are individually 10 to 12 times faster than SATA/SAS SSD. Jean Georges Perrin, 2CRSI Corporation's COO introduces you to this wonderful solution... and more. This presentation was given first on July 13th 2015 at the ISC HPC conference in Frankfurt, Germany.
Vision stratégique de l'utilisation de l'(Open)Data dans l'entrepriseJean-Georges Perrin
Vision d'une stratégie d'utilisation de l'OpenData avec définition, éco-système, freins et solutions possibles pour lever ces freins.
Proposition de la création d'un consortium d'acteurs privés & publics.
Présentation par Jean Georges Perrin, GreenIvory (http://greenivory.fr/) dans le cadre d'un atelier Rhenatic (http://www.rhenatic.eu/).
Presentation done for the AdriaUG on May 23rd 2012 in Zagreb, Croatia.
This is an updated version of the presentation done in 2010 at the IIUG conference in Overland Park, KS, USA.
Version de la présentation utilisée pour les DCF (Dirigeants Commerciaux de France) le 9 janvier 2012 près de Colmar, Alsace.
Adapté de la présentation faite à la CCI Alsace de Strasbourg en octobre 2011.
Conférence faite à la CCI de Strasbourg le 11 octobre 2011, pour illustrer le fait de mieux utiliser son site web pour mieux vendre.
Les exemples sont des réalisations mettant en oeuvre les technologies de GreenIvory.
Découvrir GreenIvory:
http://greenivory.fr/
Découvrir nos success stories:
http://greenivory.fr/success-stories.html
A la découverte des nouvelles tendances du web (Mulhouse Edition)Jean-Georges Perrin
Conférence de Jean-Georges Perrin (GreenIvory) à la CCI SAM (Sud Alsace - Mulhouse), organisée par Martine Zussy.
Sujets abordés: Web social, référencement (SEO), SMO...
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvoryJean-Georges Perrin
Présentation de Jean-Georges Perrin (CEO de GreenIvory) sur la mise en place d'une stratégie éditoriale et d'autres exemples d'utilisation de MashupXFeed. Détail sur les fermes de contenu.
MashupXFeed et le référencement - Workshop Activis - GreenivoryJean-Georges Perrin
Présentation de Présentation de Xavier-Noël Cullmann (Technico-Commercial Activis) sur les bénéfices de MashupXFeed dans le cadre de l'utilisation pour du référencement. Focus sur le duplicate content.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
11. DATA
Engineer
DATA
Scientist
Develop, build, test, and operationalize
datastores and large-scale processing
systems.
DataOps is the new DevOps.
Clean, massage, and organize data.
Perform statistics and analysis to develop
insights, build models, and search for
innovative correlations.
Match architecture
with business needs.
Develop processes
for data modeling,
mining, and
pipelines.
Improve data
reliability and quality.
Prepare data for predictive
models.
Explore data to find
hidden gems and patterns.
Tells stories to key
stakeholders.
Sources:
Adapted from https://www.datacamp.com/community/blog/data-scientist-vs-data-engineer
14. Sources:
Matei Zaharia, Spark + AI Summit 2020, https://youtu.be/p4PkA2huzVc
Databricks blog, https://databricks.com/blog/2020/06/18/introducing-apache-spark-3-0-now-available-in-databricks-runtime-7-0.html
Python rules in
Notebooks
15. A few more figures
Who does not like performance figures?
• Databricks:
• Processes >5T records/day with Structured Streaming (introduced in Spark
v2.0, stable in Spark v2.2)
• >90% of all Spark API are Spark SQL, regardless of language used
• Community:
• Spark v3.0 is roughly two times faster than Spark v2.4 in the TPC-DS 30TB
benchmark
5,000,000,000,001
Sources:
Matei Zaharia, Spark + AI Summit 2020, https://youtu.be/p4PkA2huzVc
Databricks blog, https://databricks.com/blog/2020/06/18/introducing-apache-spark-3-0-now-available-in-databricks-runtime-7-0.html
Spark v3.0.0 release notes, https://spark.apache.org/releases/spark-release-3-0-0.html
23. Always a soup
• Finally a reference guide
• http://jgp.ai/sparksql
• EXPLAIN can be FORMATTED
• Proleptic Gregorian calendar,
based on Java 8
• Overflow check
• ANSI compatibility through
configuration flag
SQL
24. Ingestion
Who needs a push down?
• Already available in databases
• Allow to filter what you ingest, before you ingest it
• Equivalent but easier than ingesting and filtering after
25. String sqlQuery =
"select actor.first_name, actor.last_name, film.title, "
+ "film.description "
+ "from actor, film_actor, film "
+ "where actor.actor_id = film_actor.actor_id "
+ "and film_actor.film_id = film.film_id";
Dataset<Row> df = spark.read().jdbc(
"jdbc:mysql://localhost:3306/sakila",
"(" + sqlQuery + ") actor_film_alias",
props);
Will only ingest the result of the MySQL query
/jgperrin/net.jgp.books.spark.ch08
Chapter 8
Lab #310
26. +---+--------+----------------------------------------------------------------------+-----------+----------------------+
| id|authorId| title|releaseDate| link|
+---+--------+----------------------------------------------------------------------+-----------+----------------------+
| 1| 1| Fantastic Beasts and Where to Find Them: The Original Screenplay| 11/18/2016|http://amzn.to/2kup94P|
| 2| 1|Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Har...| 10/06/2015|http://amzn.to/2l2lSwP|
| 3| 1| The Tales of Beedle the Bard, Standard Edition (Harry Potter)| 12/04/2008|http://amzn.to/2kYezqr|
| 4| 1|Harry Potter and the Chamber of Secrets: The Illustrated Edition (H...| 10/04/2016|http://amzn.to/2kYhL5n|
| 5| 2|Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the ...| 04/23/2017|http://amzn.to/2i3mthT|
| 6| 2|Development Tools in 2006: any Room for a 4GL-style Language?
An i...| 12/28/2016|http://amzn.to/2vBxOe1|
| 7| 3| Adventures of Huckleberry Finn| 05/26/1994|http://amzn.to/2wOeOav|
…
Dataset<Row> df = spark.read().format("csv")
…
.load("data/books.csv")
.filter("authorId = 1”);
+---+--------+----------------------------------------------------------------------+-----------+----------------------+
| id|authorId| title|releaseDate| link|
+---+--------+----------------------------------------------------------------------+-----------+----------------------+
| 1| 1| Fantastic Beasts and Where to Find Them: The Original Screenplay| 11/18/2016|http://amzn.to/2kup94P|
| 2| 1|Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Har...| 10/06/2015|http://amzn.to/2l2lSwP|
| 3| 1| The Tales of Beedle the Bard, Standard Edition (Harry Potter)| 12/04/2008|http://amzn.to/2kYezqr|
| 4| 1|Harry Potter and the Chamber of Secrets: The Illustrated Edition (H...| 10/04/2016|http://amzn.to/2kYhL5n|
+---+--------+----------------------------------------------------------------------+-----------+----------------------+
Will only ingest books where authorId is 1
/jgperrin/net.jgp.books.spark.ch07
Chapter 7
Lab #201
27. Migration tips
Yes, there are needed
• Compilation will detect some (new Exception in structured streaming)
• Runtime will throw you off:
• Parsing dates
• Data sources (v2 on the way)
• Reference
• https://spark.apache.org/docs/latest/migration-guide.html
28. org.apache.spark.SparkUpgradeException: You may get a different result due to
the upgrading of Spark 3.0: Fail to parse '2015-10-6' in the new parser. You can
set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before
Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.
SparkSession spark = SparkSession.builder()
.appName("CSV to dataframe to Dataset<Book> and back")
.master("local")
.getOrCreate();
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY");
or:
SparkSession spark = SparkSession.builder()
.appName("CSV to dataframe to Dataset<Book> and back")
.master("local")
.config("spark.sql.legacy.timeParserPolicy", "LEGACY")
.getOrCreate();
Chapter 3
Lab #320
Lab #321
/jgperrin/net.jgp.books.spark.ch03
30. The lakehouse is a full ecosystem
Or is it an operating system?
Streams
Systems
Files
Other
databases
Systems Streams
TBA?
FilesOther
databases
Business Data science Data engineering
Delta Lake &
Delta Engine
Outcome
Processing &
Storage
Data sources
31. Takeaways
• Apache Spark v3 is a major update, 3400+ patches
• Foundation for a rich data ecosystem
• Python increasingly popular, beats Scala
• Cornerstone for the lakehouse concept
34. Credits
• World of Watson by Jean-Georges Perrin CC BY-SA 4.0
• Digital Garage by Jean-Georges Perrin CC BY-SA 4.0
• Figs, grapes and rosehips by Marco Verch Professional Photographer and
Speaker, Flickr
• Soup by Valeria Boltneva from Pexels