The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
Bucketing is a partitioning technique that can improve performance in certain data transformations by avoiding data shuffling and sorting. The general idea of bucketing is to partition, and optionally sort, the data based on a subset of columns while it is written out (a one-time cost), while making successive reads of the data more performant for downstream jobs if the SQL operators can make use of this property. Bucketing can enable faster joins (i.e. single stage sort merge join), the ability to short circuit in FILTER operation if the file is pre-sorted over the column in a filter predicate, and it supports quick data sampling.
In this session, you’ll learn how bucketing is implemented in both Hive and Spark. In particular, Patil will describe the changes in the Catalyst optimizer that enable these optimizations in Spark for various bucketing scenarios. Facebook’s performance tests have shown bucketing to improve Spark performance from 3-5x faster when the optimization is enabled. Many tables at Facebook are sorted and bucketed, and migrating these workloads to Spark have resulted in a 2-3x savings when compared to Hive. You’ll also hear about real-world applications of bucketing, like loading of cumulative tables with daily delta, and the characteristics that can help identify suitable candidate jobs that can benefit from bucketing.
Real-time Analytics with Trino and Apache PinotXiang Fu
Trino summit 2021:
Overview of Trino Pinot Connector, which bridges the flexibility of Trino's full SQL support to the power of Apache Pinot's realtime analytics, giving you the best of both worlds.
Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks
You’ve heard the marketing buzz, maybe you have been to a workshop and worked with some Spark, Delta, SQL, Python, or R, but you still need some help putting all the pieces together? Join us as we review some common techniques to build a lakehouse using Delta Lake, use SQL Analytics to perform exploratory analysis, and build connectivity for BI applications.
Presto best practices for Cluster admins, data engineers and analystsShubham Tagra
This document provides best practices for using Presto across three categories: cluster admins, data engineers, and end users. For admins, it recommends optimizing JVM size, setting concurrency limits, using spot instances to reduce costs, enabling data caching, and using resource groups for isolation. For data engineers, it suggests best practices for data storage like using columnar formats and statistics. For end users, tips include using deterministic filters, explaining queries, and addressing skew through techniques like broadcast joins.
Luigi is a workflow management system that allows users to build complex data pipelines. It provides tools to define dependencies between tasks, run workflows on Hadoop, and visualize data flows. The speaker describes how they developed Luigi at Spotify to manage thousands of Hadoop jobs run daily for music recommendations and other applications. Key features of Luigi include defining Python tasks, easy command line execution, automatic dependency resolution, and failure recovery through atomic file operations. The speaker demonstrates how Luigi can run multi-step workflows on the command line, including a music recommendation example involving feature extraction, model training, and evaluation.
Getting Started with Databricks SQL AnalyticsDatabricks
It has long been said that business intelligence needs a relational warehouse, but that view is changing. With the Lakehouse architecture being shouted from the rooftops, Databricks have released SQL Analytics, an alternative workspace for SQL-savvy users to interact with an analytics-tuned cluster. But how does it work? Where do you start? What does a typical Data Analyst’s user journey look like with the tool?
This session will introduce the new workspace and walk through the various key features – how you set up a SQL Endpoint, the query workspace, creating rich dashboards and connecting up BI tools such as Microsoft Power BI.
If you’re truly trying to create a Lakehouse experience that satisfies your SQL-loving Data Analysts, this is a tool you’ll need to be familiar with and include in your design patterns, and this session will set you on the right path.
This document discusses using ClickHouse for experimentation and metrics at Spotify. It describes how Spotify built an experimentation platform using ClickHouse to provide teams interactive queries on granular metrics data with low latency. Key aspects include ingesting data from Google Cloud Storage to ClickHouse daily, defining metrics through a centralized catalog, and visualizing metrics and running queries using Superset connected to ClickHouse. The platform aims to reduce load on notebooks and BigQuery by serving common queries directly from ClickHouse.
The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
Bucketing is a partitioning technique that can improve performance in certain data transformations by avoiding data shuffling and sorting. The general idea of bucketing is to partition, and optionally sort, the data based on a subset of columns while it is written out (a one-time cost), while making successive reads of the data more performant for downstream jobs if the SQL operators can make use of this property. Bucketing can enable faster joins (i.e. single stage sort merge join), the ability to short circuit in FILTER operation if the file is pre-sorted over the column in a filter predicate, and it supports quick data sampling.
In this session, you’ll learn how bucketing is implemented in both Hive and Spark. In particular, Patil will describe the changes in the Catalyst optimizer that enable these optimizations in Spark for various bucketing scenarios. Facebook’s performance tests have shown bucketing to improve Spark performance from 3-5x faster when the optimization is enabled. Many tables at Facebook are sorted and bucketed, and migrating these workloads to Spark have resulted in a 2-3x savings when compared to Hive. You’ll also hear about real-world applications of bucketing, like loading of cumulative tables with daily delta, and the characteristics that can help identify suitable candidate jobs that can benefit from bucketing.
Real-time Analytics with Trino and Apache PinotXiang Fu
Trino summit 2021:
Overview of Trino Pinot Connector, which bridges the flexibility of Trino's full SQL support to the power of Apache Pinot's realtime analytics, giving you the best of both worlds.
Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks
You’ve heard the marketing buzz, maybe you have been to a workshop and worked with some Spark, Delta, SQL, Python, or R, but you still need some help putting all the pieces together? Join us as we review some common techniques to build a lakehouse using Delta Lake, use SQL Analytics to perform exploratory analysis, and build connectivity for BI applications.
Presto best practices for Cluster admins, data engineers and analystsShubham Tagra
This document provides best practices for using Presto across three categories: cluster admins, data engineers, and end users. For admins, it recommends optimizing JVM size, setting concurrency limits, using spot instances to reduce costs, enabling data caching, and using resource groups for isolation. For data engineers, it suggests best practices for data storage like using columnar formats and statistics. For end users, tips include using deterministic filters, explaining queries, and addressing skew through techniques like broadcast joins.
Luigi is a workflow management system that allows users to build complex data pipelines. It provides tools to define dependencies between tasks, run workflows on Hadoop, and visualize data flows. The speaker describes how they developed Luigi at Spotify to manage thousands of Hadoop jobs run daily for music recommendations and other applications. Key features of Luigi include defining Python tasks, easy command line execution, automatic dependency resolution, and failure recovery through atomic file operations. The speaker demonstrates how Luigi can run multi-step workflows on the command line, including a music recommendation example involving feature extraction, model training, and evaluation.
Getting Started with Databricks SQL AnalyticsDatabricks
It has long been said that business intelligence needs a relational warehouse, but that view is changing. With the Lakehouse architecture being shouted from the rooftops, Databricks have released SQL Analytics, an alternative workspace for SQL-savvy users to interact with an analytics-tuned cluster. But how does it work? Where do you start? What does a typical Data Analyst’s user journey look like with the tool?
This session will introduce the new workspace and walk through the various key features – how you set up a SQL Endpoint, the query workspace, creating rich dashboards and connecting up BI tools such as Microsoft Power BI.
If you’re truly trying to create a Lakehouse experience that satisfies your SQL-loving Data Analysts, this is a tool you’ll need to be familiar with and include in your design patterns, and this session will set you on the right path.
This document discusses using ClickHouse for experimentation and metrics at Spotify. It describes how Spotify built an experimentation platform using ClickHouse to provide teams interactive queries on granular metrics data with low latency. Key aspects include ingesting data from Google Cloud Storage to ClickHouse daily, defining metrics through a centralized catalog, and visualizing metrics and running queries using Superset connected to ClickHouse. The platform aims to reduce load on notebooks and BigQuery by serving common queries directly from ClickHouse.
Introduction SQL Analytics on Lakehouse ArchitectureDatabricks
This document provides an introduction and overview of SQL Analytics on Lakehouse Architecture. It discusses the instructor Doug Bateman's background and experience. The course goals are outlined as describing key features of a data Lakehouse, explaining how Delta Lake enables a Lakehouse architecture, and defining features of the Databricks SQL Analytics user interface. The course agenda is then presented, covering topics on Lakehouse Architecture, Delta Lake, and a Databricks SQL Analytics demo. Background is also provided on Lakehouse architecture, how it combines the benefits of data warehouses and data lakes, and its key features.
Python Data Wrangling: Preparing for the FutureWes McKinney
The document is a slide deck for a presentation on Python data wrangling and the future of the pandas project. It discusses the growth of the Python data science community and key projects like NumPy, pandas, and scikit-learn that have contributed to pandas' popularity. It outlines some issues with the current pandas codebase and proposes a new C++-based core called libpandas for pandas 2.0 to improve performance and interoperability. Benchmark results show serialization formats like Arrow and Feather outperforming pickle and CSV for transferring data.
The Summer 2016 release of Informatica Cloud is packed with many new platform features including :
- Cloud Data Integration Hub that supports publish and subscribe integration patterns that automate and streamline integration across cloud and on-premise sources
- Innovative features like stateful time sensitive variables, and advanced data transformations like unions and sequences
- Intelligent and dynamic data masking of sensitive data to save development and QA time.
-Cloud B2B Gateway is the leading data exchange platform for enterprises and it’ partners and customers providing end-to-end data monitoring capabilities and support for highest level of data quality.
- Enhancements to native connectors for popular cloud applications like Workday, SAP Success Factors, Oracle, SugarCRM, MongoDB, Teradata Cloud, SAP Concur, Salesforce Financial Services Cloud
And much more!
SF Big Analytics 2020-07-28
Anecdotal history of Data Lake and various popular implementation framework. Why certain tradeoff was made to solve the problems, such as cloud storage, incremental processing, streaming and batch unification, mutable table, ...
Foursquare uses Luigi to manage their complex data workflows. Luigi allows them to define tasks with dependencies in Python code rather than XML, making the workflows easier to write, test, visualize, and reuse components of. It also avoids wasted time from Cron jobs waiting and helps ensure tasks are only run once through its centralized scheduler. This provides a more robust replacement for both Cron jobs and Oozie workflows at Foursquare.
dbt Python models - GoDataFest by Guillermo SanchezGoDataDriven
Guillermo Sanchez presented on the pros and cons of using Python models in dbt. While Python models allow for more advanced analytics and leveraging the Python ecosystem, they also introduce more complexity in setup and divergent APIs across platforms. Additionally, dbt may not be well-suited for certain use cases like ingesting external data or building full MLOps pipelines. In general, Python models are best for the right analytical use cases, but caution is needed, especially for production environments.
Making Apache Spark Better with Delta LakeDatabricks
Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies the streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
In this talk, we will cover:
* What data quality problems Delta helps address
* How to convert your existing application to Delta Lake
* How the Delta Lake transaction protocol works internally
* The Delta Lake roadmap for the next few releases
* How to get involved!
The document compares the query optimizers in MySQL and PostgreSQL. It finds that the PostgreSQL query optimizer is more advanced, able to handle analytical loads better than MySQL which is better suited for transactional loads. The document covers how each handles configuration, statistics, metadata, indexing, partitioning, joins, and subqueries. It concludes the PostgreSQL optimizer has very good statistics, supports more join types and indexing capabilities, while MySQL has more limited capabilities and some queries require rewriting for best performance.
Delta from a Data Engineer's PerspectiveDatabricks
This document describes the Delta architecture, which unifies batch and streaming data processing. Delta achieves this through a continuous data flow model using structured streaming. It allows data engineers to read consistent data while being written, incrementally read large tables at scale, rollback in case of errors, replay and process historical data along with new data, and handle late arriving data without delays. Delta uses transaction logging, optimistic concurrency, and Spark to scale metadata handling for large tables. This provides a simplified solution to common challenges data engineers face.
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
This document discusses Apache Spark, a fast and general engine for big data processing. It describes how Spark generalizes the MapReduce model through its Resilient Distributed Datasets (RDDs) abstraction, which allows efficient sharing of data across parallel operations. This unified approach allows Spark to support multiple types of processing, like SQL queries, streaming, and machine learning, within a single framework. The document also outlines ongoing developments like Spark SQL and improved machine learning capabilities.
This document discusses PySpark DataFrames. It notes that DataFrames can be constructed from various data sources and are conceptually similar to tables in a relational database. The document explains that DataFrames allow richer optimizations than RDDs due to avoiding context switching between Java and Python. It provides links to resources that demonstrate how to create DataFrames, perform queries using DataFrame APIs and Spark SQL, and use an example flight data DataFrame.
This document discusses PostgreSQL statistics and how to use them effectively. It provides an overview of various PostgreSQL statistics sources like views, functions and third-party tools. It then demonstrates how to analyze specific statistics like those for databases, tables, indexes, replication and query activity to identify anomalies, optimize performance and troubleshoot issues.
Radical Speed for SQL Queries on Databricks: Photon Under the HoodDatabricks
Join this session to hear from the Photon product and engineering team talk about the latest developments with the project.
As organizations embrace data-driven decision-making, it has become imperative for them to invest in a platform that can quickly ingest and analyze massive amounts and types of data. With their data lakes, organizations can store all their data assets in cheap cloud object storage. But data lakes alone lack robust data management and governance capabilities. Fortunately, Delta Lake brings ACID transactions to your data lakes – making them more reliable while retaining the open access and low storage cost you are used to.
Using Delta Lake as its foundation, the Databricks Lakehouse platform delivers a simplified and performant experience with first-class support for all your workloads, including SQL, data engineering, data science & machine learning. With a broad set of enhancements in data access and filtering, query optimization and scheduling, as well as query execution, the Lakehouse achieves state-of-the-art performance to meet the increasing demands of data applications. In this session, we will dive into Photon, a key component responsible for efficient query execution.
Photon was first introduced at Spark and AI Summit 2020 and is written from the ground up in C++ to take advantage of modern hardware. It uses the latest techniques in vectorized query processing to capitalize on data- and instruction-level parallelism in CPUs, enhancing performance on real-world data and applications — all natively on your data lake. Photon is fully compatible with the Apache Spark™ DataFrame and SQL APIs to ensure workloads run seamlessly without code changes. Come join us to learn more about how Photon can radically speed up your queries on Databricks.
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
What if you could get the simplicity, convenience, interoperability, and storage niceties of an old-fashioned CSV with the speed of a NoSQL database and the storage requirements of a gzipped file? Enter Parquet.
At The Weather Company, Parquet files are a quietly awesome and deeply integral part of our Spark-driven analytics workflow. Using Spark + Parquet, we’ve built a blazing fast, storage-efficient, query-efficient data lake and a suite of tools to accompany it.
We will give a technical overview of how Parquet works and how recent improvements from Tungsten enable SparkSQL to take advantage of this design to provide fast queries by overcoming two major bottlenecks of distributed analytics: communication costs (IO bound) and data decoding (CPU bound).
Top 5 mistakes when writing Spark applicationshadooparchbook
This document discusses common mistakes made when writing Spark applications and provides recommendations to address them. It covers issues like having executors that are too small or large, shuffle blocks exceeding size limits, data skew slowing jobs, and excessive stages. The key recommendations are to optimize executor and partition sizes, increase partitions to reduce skew, use techniques like salting to address skew, and favor transformations like ReduceByKey over GroupByKey to minimize shuffles and memory usage.
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...HostedbyConfluent
At Stripe, we operate a general ledger modeled as double-entry bookkeeping for all financial transactions. Warehousing such data is challenging due to its high volume and high cardinality of unique accounts.
aFurthermore, it is financially critical to get up-to-date, accurate analytics over all records. Due to the changing nature of real time transactions, it is impossible to pre-compute the analytics as a fixed time series. We have overcome the challenge by creating a real time key-value store inside Pinot that can sustain half million QPS with all the financial transactions.
We will talk about the details of our solution and the interesting technical challenges faced.
Presto is an open-source distributed SQL query engine for interactive analytics. It uses a connector architecture to query data across different data sources and formats in the same query. Presto's query planning and execution involves scanning data sources, optimizing query plans, distributing queries across workers, and aggregating results. Understanding Presto's query plans helps optimize queries and troubleshoot performance issues.
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
This document discusses 5 common mistakes when writing Spark applications:
1) Improperly sizing executors by not considering cores, memory, and overhead. The optimal configuration depends on the workload and cluster resources.
2) Applications failing due to shuffle blocks exceeding 2GB size limit. Increasing the number of partitions helps address this.
3) Jobs running slowly due to data skew in joins and shuffles. Techniques like salting keys can help address skew.
4) Not properly managing the DAG to avoid shuffles and bring work to the data. Using ReduceByKey over GroupByKey and TreeReduce over Reduce when possible.
5) Classpath conflicts arising from mismatched library versions, which can be addressed using sh
In the last few years, Apache Kafka has been used extensively in enterprises for real-time data collecting, delivering, and processing. In this presentation, Jun Rao, Co-founder, Confluent, gives a deep dive on some of the key internals that help make Kafka popular.
- Companies like LinkedIn are now sending more than 1 trillion messages per day to Kafka. Learn about the underlying design in Kafka that leads to such high throughput.
- Many companies (e.g., financial institutions) are now storing mission critical data in Kafka. Learn how Kafka supports high availability and durability through its built-in replication mechanism.
- One common use case of Kafka is for propagating updatable database records. Learn how a unique feature called compaction in Apache Kafka is designed to solve this kind of problem more naturally.
Jupyter: A Gateway for Scientific Collaboration and EducationCarol Willing
This document provides an overview and agenda for an upcoming webinar on Jupyter tools for scientific collaboration and education. It begins with introductions and then covers Jupyter Notebook for interactive and reproducible computing, JupyterHub for hosting and scaling notebooks, JupyterLab as the next evolution of the notebook interface, and next steps for the Jupyter community. Examples are given throughout of different uses of Jupyter in education, research, and industry. The document concludes by inviting participants to get involved in Jupyter through various means and announcing an upcoming Jupyter conference.
The document discusses various ways that new technologies can be used to enhance geography teaching and learning. It provides ideas for using department websites, blogs, and social media like Twitter to share resources, promote the department, and engage with students and parents. It also explores using technologies for professional development, assessment, digital mapping, organizing resources, and creating interactive teaching materials like revision guides, worksheets, and quizzes. Overall, the document outlines how technologies can support online learning, collaboration, and organization across a geography department.
Introduction SQL Analytics on Lakehouse ArchitectureDatabricks
This document provides an introduction and overview of SQL Analytics on Lakehouse Architecture. It discusses the instructor Doug Bateman's background and experience. The course goals are outlined as describing key features of a data Lakehouse, explaining how Delta Lake enables a Lakehouse architecture, and defining features of the Databricks SQL Analytics user interface. The course agenda is then presented, covering topics on Lakehouse Architecture, Delta Lake, and a Databricks SQL Analytics demo. Background is also provided on Lakehouse architecture, how it combines the benefits of data warehouses and data lakes, and its key features.
Python Data Wrangling: Preparing for the FutureWes McKinney
The document is a slide deck for a presentation on Python data wrangling and the future of the pandas project. It discusses the growth of the Python data science community and key projects like NumPy, pandas, and scikit-learn that have contributed to pandas' popularity. It outlines some issues with the current pandas codebase and proposes a new C++-based core called libpandas for pandas 2.0 to improve performance and interoperability. Benchmark results show serialization formats like Arrow and Feather outperforming pickle and CSV for transferring data.
The Summer 2016 release of Informatica Cloud is packed with many new platform features including :
- Cloud Data Integration Hub that supports publish and subscribe integration patterns that automate and streamline integration across cloud and on-premise sources
- Innovative features like stateful time sensitive variables, and advanced data transformations like unions and sequences
- Intelligent and dynamic data masking of sensitive data to save development and QA time.
-Cloud B2B Gateway is the leading data exchange platform for enterprises and it’ partners and customers providing end-to-end data monitoring capabilities and support for highest level of data quality.
- Enhancements to native connectors for popular cloud applications like Workday, SAP Success Factors, Oracle, SugarCRM, MongoDB, Teradata Cloud, SAP Concur, Salesforce Financial Services Cloud
And much more!
SF Big Analytics 2020-07-28
Anecdotal history of Data Lake and various popular implementation framework. Why certain tradeoff was made to solve the problems, such as cloud storage, incremental processing, streaming and batch unification, mutable table, ...
Foursquare uses Luigi to manage their complex data workflows. Luigi allows them to define tasks with dependencies in Python code rather than XML, making the workflows easier to write, test, visualize, and reuse components of. It also avoids wasted time from Cron jobs waiting and helps ensure tasks are only run once through its centralized scheduler. This provides a more robust replacement for both Cron jobs and Oozie workflows at Foursquare.
dbt Python models - GoDataFest by Guillermo SanchezGoDataDriven
Guillermo Sanchez presented on the pros and cons of using Python models in dbt. While Python models allow for more advanced analytics and leveraging the Python ecosystem, they also introduce more complexity in setup and divergent APIs across platforms. Additionally, dbt may not be well-suited for certain use cases like ingesting external data or building full MLOps pipelines. In general, Python models are best for the right analytical use cases, but caution is needed, especially for production environments.
Making Apache Spark Better with Delta LakeDatabricks
Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies the streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
In this talk, we will cover:
* What data quality problems Delta helps address
* How to convert your existing application to Delta Lake
* How the Delta Lake transaction protocol works internally
* The Delta Lake roadmap for the next few releases
* How to get involved!
The document compares the query optimizers in MySQL and PostgreSQL. It finds that the PostgreSQL query optimizer is more advanced, able to handle analytical loads better than MySQL which is better suited for transactional loads. The document covers how each handles configuration, statistics, metadata, indexing, partitioning, joins, and subqueries. It concludes the PostgreSQL optimizer has very good statistics, supports more join types and indexing capabilities, while MySQL has more limited capabilities and some queries require rewriting for best performance.
Delta from a Data Engineer's PerspectiveDatabricks
This document describes the Delta architecture, which unifies batch and streaming data processing. Delta achieves this through a continuous data flow model using structured streaming. It allows data engineers to read consistent data while being written, incrementally read large tables at scale, rollback in case of errors, replay and process historical data along with new data, and handle late arriving data without delays. Delta uses transaction logging, optimistic concurrency, and Spark to scale metadata handling for large tables. This provides a simplified solution to common challenges data engineers face.
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
This document discusses Apache Spark, a fast and general engine for big data processing. It describes how Spark generalizes the MapReduce model through its Resilient Distributed Datasets (RDDs) abstraction, which allows efficient sharing of data across parallel operations. This unified approach allows Spark to support multiple types of processing, like SQL queries, streaming, and machine learning, within a single framework. The document also outlines ongoing developments like Spark SQL and improved machine learning capabilities.
This document discusses PySpark DataFrames. It notes that DataFrames can be constructed from various data sources and are conceptually similar to tables in a relational database. The document explains that DataFrames allow richer optimizations than RDDs due to avoiding context switching between Java and Python. It provides links to resources that demonstrate how to create DataFrames, perform queries using DataFrame APIs and Spark SQL, and use an example flight data DataFrame.
This document discusses PostgreSQL statistics and how to use them effectively. It provides an overview of various PostgreSQL statistics sources like views, functions and third-party tools. It then demonstrates how to analyze specific statistics like those for databases, tables, indexes, replication and query activity to identify anomalies, optimize performance and troubleshoot issues.
Radical Speed for SQL Queries on Databricks: Photon Under the HoodDatabricks
Join this session to hear from the Photon product and engineering team talk about the latest developments with the project.
As organizations embrace data-driven decision-making, it has become imperative for them to invest in a platform that can quickly ingest and analyze massive amounts and types of data. With their data lakes, organizations can store all their data assets in cheap cloud object storage. But data lakes alone lack robust data management and governance capabilities. Fortunately, Delta Lake brings ACID transactions to your data lakes – making them more reliable while retaining the open access and low storage cost you are used to.
Using Delta Lake as its foundation, the Databricks Lakehouse platform delivers a simplified and performant experience with first-class support for all your workloads, including SQL, data engineering, data science & machine learning. With a broad set of enhancements in data access and filtering, query optimization and scheduling, as well as query execution, the Lakehouse achieves state-of-the-art performance to meet the increasing demands of data applications. In this session, we will dive into Photon, a key component responsible for efficient query execution.
Photon was first introduced at Spark and AI Summit 2020 and is written from the ground up in C++ to take advantage of modern hardware. It uses the latest techniques in vectorized query processing to capitalize on data- and instruction-level parallelism in CPUs, enhancing performance on real-world data and applications — all natively on your data lake. Photon is fully compatible with the Apache Spark™ DataFrame and SQL APIs to ensure workloads run seamlessly without code changes. Come join us to learn more about how Photon can radically speed up your queries on Databricks.
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
What if you could get the simplicity, convenience, interoperability, and storage niceties of an old-fashioned CSV with the speed of a NoSQL database and the storage requirements of a gzipped file? Enter Parquet.
At The Weather Company, Parquet files are a quietly awesome and deeply integral part of our Spark-driven analytics workflow. Using Spark + Parquet, we’ve built a blazing fast, storage-efficient, query-efficient data lake and a suite of tools to accompany it.
We will give a technical overview of how Parquet works and how recent improvements from Tungsten enable SparkSQL to take advantage of this design to provide fast queries by overcoming two major bottlenecks of distributed analytics: communication costs (IO bound) and data decoding (CPU bound).
Top 5 mistakes when writing Spark applicationshadooparchbook
This document discusses common mistakes made when writing Spark applications and provides recommendations to address them. It covers issues like having executors that are too small or large, shuffle blocks exceeding size limits, data skew slowing jobs, and excessive stages. The key recommendations are to optimize executor and partition sizes, increase partitions to reduce skew, use techniques like salting to address skew, and favor transformations like ReduceByKey over GroupByKey to minimize shuffles and memory usage.
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...HostedbyConfluent
At Stripe, we operate a general ledger modeled as double-entry bookkeeping for all financial transactions. Warehousing such data is challenging due to its high volume and high cardinality of unique accounts.
aFurthermore, it is financially critical to get up-to-date, accurate analytics over all records. Due to the changing nature of real time transactions, it is impossible to pre-compute the analytics as a fixed time series. We have overcome the challenge by creating a real time key-value store inside Pinot that can sustain half million QPS with all the financial transactions.
We will talk about the details of our solution and the interesting technical challenges faced.
Presto is an open-source distributed SQL query engine for interactive analytics. It uses a connector architecture to query data across different data sources and formats in the same query. Presto's query planning and execution involves scanning data sources, optimizing query plans, distributing queries across workers, and aggregating results. Understanding Presto's query plans helps optimize queries and troubleshoot performance issues.
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
This document discusses 5 common mistakes when writing Spark applications:
1) Improperly sizing executors by not considering cores, memory, and overhead. The optimal configuration depends on the workload and cluster resources.
2) Applications failing due to shuffle blocks exceeding 2GB size limit. Increasing the number of partitions helps address this.
3) Jobs running slowly due to data skew in joins and shuffles. Techniques like salting keys can help address skew.
4) Not properly managing the DAG to avoid shuffles and bring work to the data. Using ReduceByKey over GroupByKey and TreeReduce over Reduce when possible.
5) Classpath conflicts arising from mismatched library versions, which can be addressed using sh
In the last few years, Apache Kafka has been used extensively in enterprises for real-time data collecting, delivering, and processing. In this presentation, Jun Rao, Co-founder, Confluent, gives a deep dive on some of the key internals that help make Kafka popular.
- Companies like LinkedIn are now sending more than 1 trillion messages per day to Kafka. Learn about the underlying design in Kafka that leads to such high throughput.
- Many companies (e.g., financial institutions) are now storing mission critical data in Kafka. Learn how Kafka supports high availability and durability through its built-in replication mechanism.
- One common use case of Kafka is for propagating updatable database records. Learn how a unique feature called compaction in Apache Kafka is designed to solve this kind of problem more naturally.
Jupyter: A Gateway for Scientific Collaboration and EducationCarol Willing
This document provides an overview and agenda for an upcoming webinar on Jupyter tools for scientific collaboration and education. It begins with introductions and then covers Jupyter Notebook for interactive and reproducible computing, JupyterHub for hosting and scaling notebooks, JupyterLab as the next evolution of the notebook interface, and next steps for the Jupyter community. Examples are given throughout of different uses of Jupyter in education, research, and industry. The document concludes by inviting participants to get involved in Jupyter through various means and announcing an upcoming Jupyter conference.
The document discusses various ways that new technologies can be used to enhance geography teaching and learning. It provides ideas for using department websites, blogs, and social media like Twitter to share resources, promote the department, and engage with students and parents. It also explores using technologies for professional development, assessment, digital mapping, organizing resources, and creating interactive teaching materials like revision guides, worksheets, and quizzes. Overall, the document outlines how technologies can support online learning, collaboration, and organization across a geography department.
This document discusses constructivism and its implications for promoting active, collaborative, inquiry-based learning in a virtual environment. It provides examples of how interactive tools, virtual manipulatives, online discussions, and collaborative projects can support constructivist learning principles. While constructivist methods may improve higher-order thinking, they do not necessarily boost performance on traditional tests, so a variety of teaching strategies is recommended.
Research Webinar: OERS and Cognitive ScienceiNACOL
This webinar provides practical information on how to use published research findings and make contact with cognitive scientists in order to improve K-12 and university students’ learning from digital online resources, like Khan Academy videos or interactive mathematics exercises. The webinar focuses on how students’ motivation and grades have been increased by helping them believe they can take charge of their learning and become smarter, and how students can be supported in reflective thinking and seeking deep understanding, when questions and prompts for students to explain are inserted in videos and interactive exercises
June 8: Designing for Open Pedagogy with CCCOERUna Daly
Please join the Community College Consortium for Open Educational Resources (CCCOER) for a free and open webinar on Designing for Open Pedagogy. Open Pedagogy was first introduced by Lumen Learning co-founder David Wiley, as a way to capture how the use of OER can change educational practices. He relates that using OER in the same way as traditional textbooks is like driving an airplane down the road – it is missing out on what open can provide for student and teacher collaboration, engagement, and learning.
When: June 8, 10amPST/1pmEST
We will hear from two professors who have not only adopted OER but have redesigned their courses with the principles of open pedagogy. Although reduced cost is what originally attracted them to using OER, involving their students in creating and evaluating OER course materials has significantly increased student engagement and critical thinking and their courses are continually being updated and improved as a result.
Featured Speakers:
• Suzanne Wakim, Biology Faculty Butte College, OER Coordinator
Will share her open course design strategy where students in subsequent semesters build on the work of those before them to create an open textbook and ancillary material. Students discuss and decide on how best to present material in the book, what applications are relevant for each topic, and what materials can help other students learn the course content.
• Mike Elmore, Political Science Faculty, Tacoma Community College
Will share how he has engaged students in collaborative writing of an Introduction to Political Science open textbook. His students report that writing assignments take on new meaning when they realize that other people are going to read their work. Not just repeating what they have read or heard in class, they compare their understanding with their peers and collaborate to present their ideas in the best way possible.
Participant Login Information:
No pre-registration is necessary. Please use the link below on the day of the webinar to login and listen.
http://www.cccconfer.org/GoToMeeting?SeriesID=62446bc7-ca21-4fb3-a56b-7f135cc8cde4
Posted by: Una Daly, Director of Curriculum Design & College Outreach, OEC Consortium, email: unatdaly@oeconsortium.org
Creating authentic classroom scenarios to enhance student learning debbieholley1
A keynote presentation to the BU Careers and Teachers conference #BUCATC
This keynote suggests that with the move to models of school based teacher training, that trainee teachers and NQTs find the CPD on offer unsuitable to their needs. Here I suggest that harnessing their desire co-create with their peers and develop solutions to real-time issues can be scaffolded and mediated through simple technology such as augmented reality, and I present a model to take the work forward.
Collaborating with students - Reflections on UCC co-creating learning experie...CONUL Teaching & Learning
This document summarizes a project where UCC Library collaborated with students to improve its Canvas course and create interactive learning objects. Six students provided feedback and co-created content like H5P objects and scavenger hunts. Their feedback led to improvements like clearer module structure and interactive elements. Students enjoyed contributing and learning new skills. Challenges included promoting the scavenger hunt app and remote coordination. Future plans include continued accessibility work and student involvement.
The document discusses the 7Cs of learning design proposed by Gráinne Conole. The 7Cs include: conceptualize, capture, communicate, collaborate, consider, consolidate, and continue. Conole outlines how new technologies have led to more open, social, and participatory approaches to learning. However, replicating old pedagogies with new tools does not fully leverage their potential. The learning design process emphasizes explicit design methods and sharing of practices. It encourages reflecting on how to harness new technologies and resources while rethinking support and assessment of learning.
The document discusses the 7Cs framework for learning design proposed by Gráinne Conole. It outlines characteristics of new media technologies and their implications for learning, teaching and research. Some key points include: new technologies allow for peer critiquing, user-generated content, and networked and personalized learning. However, their potential is not fully realized as existing pedagogies are often replicated without taking advantage of new opportunities. The 7Cs framework - conceptualize, create, communicate, consume, collaborate, contribute, and critique - provides a design-based approach that encourages reflective practices and sharing. It can help educators harness new technologies while rethinking design, support and assessment of learning.
This document outlines an agenda and objectives for a seminar on flipping the classroom for science teachers. The agenda covers topics such as the essentials of flipping, creating accountability, starting with online videos, and creating digital resources. Objectives include explaining how to get started with a flipped classroom, identifying effective apps and tools, and designing rigorous science lessons aligned with standards. The document also provides examples of video lessons, software for creating videos, and models for implementing a flipped classroom approach.
At NCSU, librarians have developed a curriculum which is being offered to the library community as the Data and Visualization Institute for Librarians, enabling participants to develop knowledge, skills, and confidence to communicate effectively with researchers.This presentation will discuss the skills liaison librarians must now learn to support faculty and students in these new areas.
- Provide faculty and staff with an understanding of what Pinterest is and how it may be used in the classroom environment.
- Share a step-by-step implementation plan on setting up a Pinterest board and starting the process
- Share case studies and examples of educational institutions using Pinterest within the classroom, along with best practices.
This model lesson will demonstrate how students can collect and share data and produce a digital report. Bring your own device to participate as a student or come observe all the action.
Slides presented (virtually) by Professor Rebecca Ferguson of The Open University at the Teach4Edu4 multiplier event held in Birmingham, UK, in January 2023. This presentation formed part of a larger workshop with multiple speakers from The Open University.
Muir Lake School, a part of Parkland School Division, is becoming a 1-to-1 BYOD learning community. The mission behind this initiative is "our students will innovate, collaborate, and be highly motivated about their learning". The goal is that every student will have access to a personal laptop in every class to use whenever it is the best tool for the learning activity. The initiative was piloted in grade 4 and grade 9 and will be expanding to all grades 4 through 9. This presentation outlines the "why" behind the initiative and first steps of Muir Lake School's journey. Google Doc Quick Link → bit.ly/MLS1to1
This keynote presentation will provide an overview of field-based learning - an active, inquiry-based teaching and learning strategy where teaching and learning is extended beyond the classroom/laboratory walls and where students are exposed to real-world teaching and learning settings in the broader community. In field-based learning, students learn by hands-on application of course content and though direct interaction with the environment rather than solely through textbooks and lectures.
This document discusses machine learning and the Jupyter project. It introduces Carol Willing, who is involved with Project Jupyter, Python Software Foundation, and CPython. It notes that Jupyter focuses on learning, usability, reproducibility, and collaboration. JupyterLab is introduced as Jupyter's next-generation web-based interface that can be tried on Kubernetes using tools like JupyterHub and Kubeflow. The document emphasizes using machine learning to solve problems through collaboration and open source tools.
STEAM Workshops with Binder and JupyterHubCarol Willing
This document summarizes Carol Willing's presentation on using Jupyter notebooks and Binder for STEAM workshops. It discusses how notebooks can engage learners through interactive experiences and help structure teaching content. Binder allows sharing notebook environments without installation. Examples are given of open educational resources using these tools, including for music, science, and teaching Python. Building an active learning community is emphasized through meetups, workshops and inviting new learners.
Learning Python: Tips from Cognitive Science, Jupyter, and CommunityCarol Willing
This document discusses tips for learning Python from a cognitive science perspective. It recommends using tools like Jupyter notebooks that engage learners through interactive experiences. Notebooks can be used to teach subjects like signal processing with examples of wearables. The Python community is praised for its support of new learners through meetups, workshops and groups like DjangoGirls and PyLadies. The document advocates choosing Python as it is designed for learning, sharing your work, and encouraging others in their Python journeys.
This document discusses how Jupyter Notebooks can be used to teach music concepts through interactive narratives. It provides examples of existing Jupyter content for music, including libraries for music analysis and generation. Tools like Jupyter Notebooks allow creating engaging content that combines code, prose, visualizations and audio to explore music concepts. The document encourages reusing and sharing Jupyter content through repositories and platforms like Binder to extend teaching resources.
This document introduces Zero to JupyterHub, which allows for scalable JupyterHub deployments on Kubernetes. It provides instructions on setting up a JupyterHub instance on Google Cloud using Kubernetes and Helm. The tutorial goals are to launch a JupyterHub on Google Cloud using free credits, understand allocated resources for users, perform basic debugging, and tear down the deployment. Key steps include signing up for Google Cloud, activating products, installing Kubernetes components like kubectl and Helm, adding the JupyterHub Helm chart repository, generating configuration files, and installing JupyterHub using Helm.
This document provides an agenda and overview for a JupyterHub tutorial. The tutorial will cover deploying JupyterHub, including installation, configuration, customization of authentication and spawning, and optimizations. Attendees will learn about using JupyterHub for students and researchers, and best practices for deployment. The tutorial is split into sessions on JupyterHub fundamentals, custom spawners like DockerSpawner, reference deployments, and the JupyterHub API. There will be a morning break and wrap up session at the end.
Python and Jupyter: Your Gateway for LearningCarol Willing
Carol Willing discusses how Python and Jupyter can serve as gateways for learning. She describes her own journey from beginner to core developer, highlighting resources like PyLadies workshops and conferences that helped her learn. Jupyter Notebook is introduced as an interactive computing environment for exploration, analysis and reproducible documents. Examples are given of how Jupyter Notebooks are used in education and scientific domains. The talk concludes by encouraging the audience to get involved and help others on their learning paths.
This document introduces Jupyter, an open-source web application that allows users to create and share documents that contain live code, equations, visualizations and narrative text. It summarizes a workshop that was held to teach participants how to use Jupyter and empower them to explore, create and change through coding. The workshop provided examples of how Jupyter can be used across different fields and encouraged participants to discover how they could apply Jupyter in their own work.
This document is a presentation by Carol Willing about mentoring. It tells a story from 1977-1983 about how Carol was mentored by Margaret Daniels Tyler, who encouraged her to pursue a PhD in electrical engineering when Carol was feeling broken, isolated and like a failure. The presentation discusses why mentoring matters for creating inclusive communities and solving real-world problems. It provides tips for how to mentor others, with the acronym "SLQAR" which stands for show up, listen, question, act, and recharge. The presentation aims to encourage the audience to mentor and support others.
JupyterHub - A "Thing Explainer" OverviewCarol Willing
JupyterHub allows each user in a group to have their own Jupyter notebook server. It has three main parts: the hub, which manages authentication and spawns single-user notebook servers; a user database to store user information; and an authenticator to verify users' identities. When a user logs in, the hub's spawner creates a dedicated notebook server for that user. JupyterHub is useful for shared computing resources like classrooms, workshops, or research groups.
JupyterHub - A "Thing Explainer" OverviewCarol Willing
JupyterHub allows each user to have their own Jupyter notebook server by using a central hub that handles authentication, spawns single-user notebook servers on demand, and proxies requests to the various user servers. The hub consists of a user database, authenticator to check user identities, and a spawner that creates the individual user servers. JupyterHub is useful for classes where students can do homework, workshops that require installation, or research groups that want to share computing resources through a centralized hub.
JupyterHub for Interactive Data Science CollaborationCarol Willing
This document discusses Project Jupyter, an open-source project that allows for interactive data science and scientific computing through Jupyter notebooks. It provides an overview of Jupyter notebooks and their use for "literate computing", highlighting examples in education, government, business, science, and collaboration. Key points include the ability of Jupyter notebooks to support exploratory data analysis, reproducibility, and narrative-driven communication of computational activities and results across many domains and audiences.
This document discusses using JupyterHub for user group meetings and workshops to allow participants to "Learn All the Things" in a safe environment. It recommends trying JupyterHub for a future user group meeting or workshop as it allows a data scientist, developer, and devops professional to run Jupyter notebooks without worrying about breaking their own systems or not knowing where to start. It provides several links to resources about JupyterHub, including tutorials, reference deployments, and the JupyterHub documentation.
This document provides an introduction to Git and GitHub. It explains that Git is a tool that allows users to have local control over source files, while GitHub is a service for sharing and collaborating on projects in the cloud. It then discusses the relationships between the local, origin, and upstream repositories. The document guides users through the basic workflow of forking a project, cloning it locally, and adding a remote. It also covers commands for making and sharing code changes like adding, committing, fetching, rebasing, and pushing files. The overall goal is to empower users to get started with version control and open source contribution.
Python - The People's Programming LanguageCarol Willing
Keynote for PyCon Philippines 2016 in Cebu. This talk explores how Python and people partner to address complex real world issues. Project Jupyter and its notebooks are also shared.
This document provides guidance for contributing to the CPython project. It uses a musical metaphor of finding your groove. It recommends gathering the necessary tools like an editor and version control. It suggests checking out the CPython source code from its repository and learning about the code structure. It offers ways to listen and learn like watching Python videos. It encourages joining communities like mailing lists and sprints. It emphasizes that mistakes will happen but to keep trying new things. It provides examples of contributions like improving documentation and testing. The overall message is to enjoy the journey of contributing to the CPython project.
This document provides guidance on developing wearable technology prototypes while considering privacy and the user experience. It recommends prototyping with sensors for movement and activity, using sensor fusion to make prototyping easier, choosing software for easy development, deciding on a firmware or operating system, selecting appropriate enclosures, iterating based on hardware and software needs, and being mindful of privacy by providing users control over data collection and indicators of camera use. The document aims to help developers balance functionality and privacy concerns when creating wearable technology prototypes.
2014 01 23_pyladies_san diego python user groupCarol Willing
The San Diego Python User Group document outlines their outreach and education efforts, including hack evenings with 74 members on their Meetup group, Google Hangouts for a large geographic area, and introductory education sessions on topics like Python, Pelican, Git, and data science libraries. It also discusses their K-12 education support through conferences and events, as well as upcoming initiatives like a new PyLadies logo, website, and more hacknights.
2014 01 23_pycon_san diego python user group meetingCarol Willing
This document promotes upcoming Python conferences in 2014, including PyCon in either Montreal or San Diego in March/April. It highlights the learning, networking, and development opportunities at PyCon, considered Python's premier conference, and also lists several other relevant regional events in locations closer to San Diego throughout the year, such as SCALE12x in February and the Geek Girl Tech Conference in June. It encourages attending PyCon for the opportunities it provides but notes local events can also be enjoyed without travel.
Chapter wise All Notes of First year Basic Civil Engineering.pptxDenish Jangid
Chapter wise All Notes of First year Basic Civil Engineering
Syllabus
Chapter-1
Introduction to objective, scope and outcome the subject
Chapter 2
Introduction: Scope and Specialization of Civil Engineering, Role of civil Engineer in Society, Impact of infrastructural development on economy of country.
Chapter 3
Surveying: Object Principles & Types of Surveying; Site Plans, Plans & Maps; Scales & Unit of different Measurements.
Linear Measurements: Instruments used. Linear Measurement by Tape, Ranging out Survey Lines and overcoming Obstructions; Measurements on sloping ground; Tape corrections, conventional symbols. Angular Measurements: Instruments used; Introduction to Compass Surveying, Bearings and Longitude & Latitude of a Line, Introduction to total station.
Levelling: Instrument used Object of levelling, Methods of levelling in brief, and Contour maps.
Chapter 4
Buildings: Selection of site for Buildings, Layout of Building Plan, Types of buildings, Plinth area, carpet area, floor space index, Introduction to building byelaws, concept of sun light & ventilation. Components of Buildings & their functions, Basic concept of R.C.C., Introduction to types of foundation
Chapter 5
Transportation: Introduction to Transportation Engineering; Traffic and Road Safety: Types and Characteristics of Various Modes of Transportation; Various Road Traffic Signs, Causes of Accidents and Road Safety Measures.
Chapter 6
Environmental Engineering: Environmental Pollution, Environmental Acts and Regulations, Functional Concepts of Ecology, Basics of Species, Biodiversity, Ecosystem, Hydrological Cycle; Chemical Cycles: Carbon, Nitrogen & Phosphorus; Energy Flow in Ecosystems.
Water Pollution: Water Quality standards, Introduction to Treatment & Disposal of Waste Water. Reuse and Saving of Water, Rain Water Harvesting. Solid Waste Management: Classification of Solid Waste, Collection, Transportation and Disposal of Solid. Recycling of Solid Waste: Energy Recovery, Sanitary Landfill, On-Site Sanitation. Air & Noise Pollution: Primary and Secondary air pollutants, Harmful effects of Air Pollution, Control of Air Pollution. . Noise Pollution Harmful Effects of noise pollution, control of noise pollution, Global warming & Climate Change, Ozone depletion, Greenhouse effect
Text Books:
1. Palancharmy, Basic Civil Engineering, McGraw Hill publishers.
2. Satheesh Gopi, Basic Civil Engineering, Pearson Publishers.
3. Ketki Rangwala Dalal, Essentials of Civil Engineering, Charotar Publishing House.
4. BCP, Surveying volume 1
A Visual Guide to 1 Samuel | A Tale of Two HeartsSteve Thomason
These slides walk through the story of 1 Samuel. Samuel is the last judge of Israel. The people reject God and want a king. Saul is anointed as the first king, but he is not a good king. David, the shepherd boy is anointed and Saul is envious of him. David shows honor while Saul continues to self destruct.
How to Setup Warehouse & Location in Odoo 17 InventoryCeline George
In this slide, we'll explore how to set up warehouses and locations in Odoo 17 Inventory. This will help us manage our stock effectively, track inventory levels, and streamline warehouse operations.
How to Make a Field Mandatory in Odoo 17Celine George
In Odoo, making a field required can be done through both Python code and XML views. When you set the required attribute to True in Python code, it makes the field required across all views where it's used. Conversely, when you set the required attribute in XML views, it makes the field required only in the context of that particular view.
18. Motivate 1
• Work with student interests
https://nbviewer.jupyter.org/
• Wow with possibilities
https://github.com/jupyter/jupyter/wiki/A-gallery-
of-interesting-Jupyter-Notebooks
• Get them started with tmpnb and binder
https://try.jupyter.org
http://mybinder.org/
Gallery of Interesting Notebooks
nbviewer
Project Jupyter community
19. • Start with a proven curriculum
http://pyvideo.org/pycon-us-2013/a-hands-on-
introduction-to-python-for-beginning-p.html
• Hands on to engage students
• Takeaway notebooks reduce
student stress
https://github.com/pythonsd/intro-to-
python
Motivate 2
Intro to Python
San Diego Python
20. • Exploration and experimentation
http://pyvideo.org/scipy-2016/labs-in-the-wild-teaching-
signal-processing-using-wearables-jupyter-notebooks-
scipy-2016.html
• Physical media with wearables and
electronics
• Real world, self-directed projects
Motivate 3
Teaching signal processing
using Wearables and Jupyter
Notebooks
Demba Ba
21. • Feedback and communication with
students using nbgrader
http://kristenthyng.com/blog/2016/09/07/
jupyterhub+nbgrader/
• Progression to complex examples
and tasks
https://github.com/kthyng/
python4geosciences
Develop mastery 1
Python for Geosciences
Kristen Thyng
22. Excellent resource on using tmpnb and
JupyterHub for teaching
http://jupyter.rocks/
https://github.com/tanyaschlusser/Jupyter-with-R
Develop mastery 2
Using Jupyter notebooks
with R in the classroom
Tanya Schlusser
23. Develop mastery 3
Cal Poly SLO
Data Science 301
Brian Granger
• Intensive data science course for
undergraduates
https://github.com/calpolydatascience/data301
• Ansible deployment
https://github.com/jupyterhub/jupyterhub-deploy-
teaching
• Research project and student
interns
http://www.calpolynews.calpoly.edu/news_releases/2015/
July/jupyter.html
24. Apply knowledge 1
Berkeley Data Science
Data8
UC Berkeley
http://denero.org/data-8-in-spring-2017.html
https://github.com/data-8/jupyterhub-k8s
http://data8.org/
http://data.berkeley.edu/
http://data.berkeley.edu/about/videos
•Campus wide curriculum
•Cross-discipline
•Kubernetes deployment of
JupyterHub
• Zero to JupyterHub with Kubernetes
https://zero-to-jupyterhub.readthedocs.io
26. Next
steps
• Join Jupyter in Education community
• Try no installation needed solutions
• Try tmpnb with a workshop
• Offer a course with JupyterHub
• Scale your curriculum to other courses
30. Questions?
• Steering Council, Project Jupyter
• Software Engineer, Cal Poly SLO
• Director, Python Software Foundation
• Geek in Residence, Fab Lab San Diego
Carol Willing
@willingcarol
31. • Kristen Thyng
• San Diego Python
• UC Berkeley Data Science
• Cal Poly SLO
• Tanya Schlusser
• Demba Ba
• Project Jupyter team and community
Attributions and recognition