The document describes a dataset containing on-time performance records for 95% of commercial flights in the United States. It includes over 30 fields of information for each flight such as airline, departure/arrival times, delays, distances, and causes of delays. An example record from the dataset is shown containing values for many of the fields.
Agile Data Science 2.0 covers the theory and practice of applying agile methods to the practice of applied analytics research called data science. The book takes the stance that data products are the preferred output format for data science teams to effect change in an organization. Accordingly, we show how to "get meta" to enable agility in building applications describing the applied research process itself. Then we show how to use 'big data' tools to iteratively build, deploy and refine analytics applications. Tracking data-product development through the five stages of the "data value pyramid", we show you how to build applications from conception through development through deployment and then through iterative improvement. Application development is a fundamental skill for a data scientist, and by publishing your data science work as a web application, we show you how to effect maximal change within your organization.
Technologies covered include Python, Apache Spark (Spark MLlib, Spark Streaming), Apache Kafka, MongoDB, ElasticSearch and Apache Airflow.
Agile Data Science 2.0 (O’Reilly 2017) defines a methodology and a software stack with which to apply the methods. The methodology seeks to deliver data products in short sprints by going meta and putting the focus on the applied research process itself. The stack is but an example of one meeting the requirements that it be utterly scalable and utterly efficient in use by application developers as well as data engineers. It includes everything needed to build a full-blown predictive system: Apache Spark, Apache Kafka, Apache Incubating Airflow, MongoDB, ElasticSearch, Apache Parquet, Python/Flask, JQuery. This talk will cover the full lifecycle of large data application development and will show how to use lessons from agile software engineering to apply data science using this full-stack to build better analytics applications.
Agile Data Science 2.0 (O'Reilly 2017) defines a methodology and a software stack with which to apply the methods. *The methodology* seeks to deliver data products in short sprints by going meta and putting the focus on the applied research process itself. *The stack* is but an example of one meeting the requirements that it be utterly scalable and utterly efficient in use by application developers as well as data engineers. It includes everything needed to build a full-blown predictive system: Apache Spark, Apache Kafka, Apache Incubating Airflow, MongoDB, ElasticSearch, Apache Parquet, Python/Flask, JQuery. This talk will cover the full lifecycle of large data application development and will show how to use lessons from agile software engineering to apply data science using this full-stack to build better analytics applications. The entire lifecycle of big data application development is discussed. The system starts with plumbing, moving on to data tables, charts and search, through interactive reports, and building towards predictions in both batch and realtime (and defining the role for both), the deployment of predictive systems and how to iteratively improve predictions that prove valuable.
Agile Data Science 2.0 (O'Reilly 2017) defines a methodology and a software stack with which to apply the methods. *The methodology* seeks to deliver data products in short sprints by going meta and putting the focus on the applied research process itself. *The stack* is but an example of one meeting the requirements that it be utterly scalable and utterly efficient in use by application developers as well as data engineers. It includes everything needed to build a full-blown predictive system: Apache Spark, Apache Kafka, Apache Incubating Airflow, MongoDB, ElasticSearch, Apache Parquet, Python/Flask, JQuery. This talk will cover the full lifecycle of large data application development and will show how to use lessons from agile software engineering to apply data science using this full-stack to build better analytics applications. The entire lifecycle of big data application development is discussed. The system starts with plumbing, moving on to data tables, charts and search, through interactive reports, and building towards predictions in both batch and realtime (and defining the role for both), the deployment of predictive systems and how to iteratively improve predictions that prove valuable.
Agile Data Science 2.0 covers the theory and practice of applying agile methods to the practice of applied analytics research called data science. The book takes the stance that data products are the preferred output format for data science teams to effect change in an organization. Accordingly, we show how to "get meta" to enable agility in building applications describing the applied research process itself. Then we show how to use 'big data' tools to iteratively build, deploy and refine analytics applications. Tracking data-product development through the five stages of the "data value pyramid", we show you how to build applications from conception through development through deployment and then through iterative improvement. Application development is a fundamental skill for a data scientist, and by publishing your data science work as a web application, we show you how to effect maximal change within your organization.
Technologies covered include Python, Apache Spark (Spark MLlib, Spark Streaming), Apache Kafka, MongoDB, ElasticSearch and Apache Airflow.
Agile Data Science 2.0 (O’Reilly 2017) defines a methodology and a software stack with which to apply the methods. The methodology seeks to deliver data products in short sprints by going meta and putting the focus on the applied research process itself. The stack is but an example of one meeting the requirements that it be utterly scalable and utterly efficient in use by application developers as well as data engineers. It includes everything needed to build a full-blown predictive system: Apache Spark, Apache Kafka, Apache Incubating Airflow, MongoDB, ElasticSearch, Apache Parquet, Python/Flask, JQuery. This talk will cover the full lifecycle of large data application development and will show how to use lessons from agile software engineering to apply data science using this full-stack to build better analytics applications.
Agile Data Science 2.0 (O'Reilly 2017) defines a methodology and a software stack with which to apply the methods. *The methodology* seeks to deliver data products in short sprints by going meta and putting the focus on the applied research process itself. *The stack* is but an example of one meeting the requirements that it be utterly scalable and utterly efficient in use by application developers as well as data engineers. It includes everything needed to build a full-blown predictive system: Apache Spark, Apache Kafka, Apache Incubating Airflow, MongoDB, ElasticSearch, Apache Parquet, Python/Flask, JQuery. This talk will cover the full lifecycle of large data application development and will show how to use lessons from agile software engineering to apply data science using this full-stack to build better analytics applications. The entire lifecycle of big data application development is discussed. The system starts with plumbing, moving on to data tables, charts and search, through interactive reports, and building towards predictions in both batch and realtime (and defining the role for both), the deployment of predictive systems and how to iteratively improve predictions that prove valuable.
Agile Data Science 2.0 (O'Reilly 2017) defines a methodology and a software stack with which to apply the methods. *The methodology* seeks to deliver data products in short sprints by going meta and putting the focus on the applied research process itself. *The stack* is but an example of one meeting the requirements that it be utterly scalable and utterly efficient in use by application developers as well as data engineers. It includes everything needed to build a full-blown predictive system: Apache Spark, Apache Kafka, Apache Incubating Airflow, MongoDB, ElasticSearch, Apache Parquet, Python/Flask, JQuery. This talk will cover the full lifecycle of large data application development and will show how to use lessons from agile software engineering to apply data science using this full-stack to build better analytics applications. The entire lifecycle of big data application development is discussed. The system starts with plumbing, moving on to data tables, charts and search, through interactive reports, and building towards predictions in both batch and realtime (and defining the role for both), the deployment of predictive systems and how to iteratively improve predictions that prove valuable.
Running Intelligent Applications inside a Database: Deep Learning with Python...Miguel González-Fierro
In this talk we present a new paradigm of computation where the intelligence is computed inside the database. Standard software systems must get the data from the database to execute a routine. If the size of the data is big, there are inefficiencies due to the data movement. Store procedures tried to solve this issue in the past, allowing for computing simple functions inside the database. However, only simple routines can be executed.
To showcase the capabilities of our new system, we created a lung cancer detection algorithm using Microsoft’s Cognitive Toolkit, also known as CNTK. We used transfer learning between ImageNet dataset, which contains natural images, and a lung cancer dataset, which contains scans of horizontal sections of the lung for healthy and sick patients. Specifically, a pretrained Convolutional Neural Network on ImageNet is used on the lung cancer dataset to generate features. Once the features are computed, a boosted tree is applied to predict whether the patient has cancer or not.
All this process is computed inside the database, so the data movement is minimized. We are even able to execute the algorithm using the GPU of the virtual machine that hosts the database. Using a GPU, we can compute the featurization in less than 1h, in contrast to using a CPU, that would take up to 32h. Finally, we set up an API to connect the solution to a web app, where a doctor can analyze the images and get a prediction of a patient.
Networks All Around Us: Extracting networks from your problem domainRussell Jurney
Network analytics are being increasingly utilized to create machine intelligence that automates the world around us. But what is a network, and how do you analyze them? More directly: how do I find and analyze networks in my dataset? This talk will go over a number of examples of practical network analytics to give viewers a playbook for doing applied social network analysis and network analytics.
Applied Machine learning using H2O, python and R WorkshopAvkash Chauhan
Note: Get all workshop content at - https://github.com/h2oai/h2o-meetups/tree/master/2017_02_22_Seattle_STC_Meetup
Basic knowledge of R/python and general ML concepts
Note: This is bring-your-own-laptop workshop. Make sure you bring your laptop in order to be able to participate in the workshop
Level: 200
Time: 2 Hours
Agenda:
- Introduction to ML, H2O and Sparkling Water
- Refresher of data manipulation in R & Python
- Supervised learning
---- Understanding liner regression model with an example
---- Understanding binomial classification with an example
---- Understanding multinomial classification with an example
- Unsupervised learning
---- Understanding k-means clustering with an example
- Using machine learning models in production
- Sparkling Water Introduction & Demo
Data science with Windows Azure - A Brief IntroductionAdnan Masood
Data Science with Windows Azure is an introduction to HDInsight and Hadoop offerings from Microsoft Machine Learning and Big Data Cloud based platform. This was presented at Microsoft Data Science Group – Tampa Analytics Professionals.
NoSQL: The first New Jakarta EE Specification (DWX 2019)Werner Keil
Jakarta EE NoSQL is a framework and collection of tools that make integration between Java applications and NoSQL quick and easy—for developers as well as vendors. The API is easy to implement, so NoSQL vendors can quickly implement, test, and become compliant by themselves. And with its low learning curve and just a minimal set of artifacts, Java developers can start coding without having to worry about the complexity of specific NoSQL databases instead of their core aspects (such as graph or document properties). Built with functional programming in mind, it leverages all the features of Java 8 and above.
This session covers how the API is structured, how it relates to the multiple NoSQL database types, and how you can get started and involved in this open source technology and help the first new Jakarta EE specification evolve.
Eclipse science group presentation given at Eclipse Converge and Devoxx 2017 in California. These slides give an overview of projects in the Eclipse Science working group in 2017.
How Concur uses Big Data to get you to Tableau Conference On TimeDenny Lee
This is my presentation from Tableau Conference #Data14 as the Cloudera Customer Showcase - How Concur uses Big Data to get you to Tableau Conference On Time. We discuss Hadoop, Hive, Impala, and Spark within the context of Consolidation, Visualization, Insight, and Recommendation.
Slides for Data Syndrome one hour course on PySpark. Introduces basic operations, Spark SQL, Spark MLlib and exploratory data analysis with PySpark. Shows how to use pylab with Spark to create histograms.
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
Many data pipelines share common characteristics and are often built in similar but bespoke ways, even within a single organisation. In this talk, we will outline the key considerations which need to be applied when building data pipelines, such as performance, idempotency, reproducibility, and tackling the small file problem. We’ll work towards describing a common Data Engineering toolkit which separates these concerns from business logic code, allowing non-Data-Engineers (e.g. Business Analysts and Data Scientists) to define data pipelines without worrying about the nitty-gritty production considerations.
We’ll then introduce an implementation of such a toolkit in the form of Waimak, our open-source library for Apache Spark (https://github.com/CoxAutomotiveDataSolutions/waimak), which has massively shortened our route from prototype to production. Finally, we’ll define new approaches and best practices about what we believe is the most overlooked aspect of Data Engineering: deploying data pipelines.
Data Seeding via Parameterized API RequestsRapidValue
A quick guide on how to data seed via parameterized API requests. Parameterization is very important for automation testing. It helps you to iterate on input data with multiple data sets that make your scripts reusable and maintainable. In few scenarios, you can still manage with hard coded request but the same approach will not work out where sheer count of combinations is to be validated. By implementing the right solution, you can keep your code base and test data size at ideal range and still savor the benefits of optimal coverage.
Running Intelligent Applications inside a Database: Deep Learning with Python...Miguel González-Fierro
In this talk we present a new paradigm of computation where the intelligence is computed inside the database. Standard software systems must get the data from the database to execute a routine. If the size of the data is big, there are inefficiencies due to the data movement. Store procedures tried to solve this issue in the past, allowing for computing simple functions inside the database. However, only simple routines can be executed.
To showcase the capabilities of our new system, we created a lung cancer detection algorithm using Microsoft’s Cognitive Toolkit, also known as CNTK. We used transfer learning between ImageNet dataset, which contains natural images, and a lung cancer dataset, which contains scans of horizontal sections of the lung for healthy and sick patients. Specifically, a pretrained Convolutional Neural Network on ImageNet is used on the lung cancer dataset to generate features. Once the features are computed, a boosted tree is applied to predict whether the patient has cancer or not.
All this process is computed inside the database, so the data movement is minimized. We are even able to execute the algorithm using the GPU of the virtual machine that hosts the database. Using a GPU, we can compute the featurization in less than 1h, in contrast to using a CPU, that would take up to 32h. Finally, we set up an API to connect the solution to a web app, where a doctor can analyze the images and get a prediction of a patient.
Networks All Around Us: Extracting networks from your problem domainRussell Jurney
Network analytics are being increasingly utilized to create machine intelligence that automates the world around us. But what is a network, and how do you analyze them? More directly: how do I find and analyze networks in my dataset? This talk will go over a number of examples of practical network analytics to give viewers a playbook for doing applied social network analysis and network analytics.
Applied Machine learning using H2O, python and R WorkshopAvkash Chauhan
Note: Get all workshop content at - https://github.com/h2oai/h2o-meetups/tree/master/2017_02_22_Seattle_STC_Meetup
Basic knowledge of R/python and general ML concepts
Note: This is bring-your-own-laptop workshop. Make sure you bring your laptop in order to be able to participate in the workshop
Level: 200
Time: 2 Hours
Agenda:
- Introduction to ML, H2O and Sparkling Water
- Refresher of data manipulation in R & Python
- Supervised learning
---- Understanding liner regression model with an example
---- Understanding binomial classification with an example
---- Understanding multinomial classification with an example
- Unsupervised learning
---- Understanding k-means clustering with an example
- Using machine learning models in production
- Sparkling Water Introduction & Demo
Data science with Windows Azure - A Brief IntroductionAdnan Masood
Data Science with Windows Azure is an introduction to HDInsight and Hadoop offerings from Microsoft Machine Learning and Big Data Cloud based platform. This was presented at Microsoft Data Science Group – Tampa Analytics Professionals.
NoSQL: The first New Jakarta EE Specification (DWX 2019)Werner Keil
Jakarta EE NoSQL is a framework and collection of tools that make integration between Java applications and NoSQL quick and easy—for developers as well as vendors. The API is easy to implement, so NoSQL vendors can quickly implement, test, and become compliant by themselves. And with its low learning curve and just a minimal set of artifacts, Java developers can start coding without having to worry about the complexity of specific NoSQL databases instead of their core aspects (such as graph or document properties). Built with functional programming in mind, it leverages all the features of Java 8 and above.
This session covers how the API is structured, how it relates to the multiple NoSQL database types, and how you can get started and involved in this open source technology and help the first new Jakarta EE specification evolve.
Eclipse science group presentation given at Eclipse Converge and Devoxx 2017 in California. These slides give an overview of projects in the Eclipse Science working group in 2017.
How Concur uses Big Data to get you to Tableau Conference On TimeDenny Lee
This is my presentation from Tableau Conference #Data14 as the Cloudera Customer Showcase - How Concur uses Big Data to get you to Tableau Conference On Time. We discuss Hadoop, Hive, Impala, and Spark within the context of Consolidation, Visualization, Insight, and Recommendation.
Slides for Data Syndrome one hour course on PySpark. Introduces basic operations, Spark SQL, Spark MLlib and exploratory data analysis with PySpark. Shows how to use pylab with Spark to create histograms.
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
Many data pipelines share common characteristics and are often built in similar but bespoke ways, even within a single organisation. In this talk, we will outline the key considerations which need to be applied when building data pipelines, such as performance, idempotency, reproducibility, and tackling the small file problem. We’ll work towards describing a common Data Engineering toolkit which separates these concerns from business logic code, allowing non-Data-Engineers (e.g. Business Analysts and Data Scientists) to define data pipelines without worrying about the nitty-gritty production considerations.
We’ll then introduce an implementation of such a toolkit in the form of Waimak, our open-source library for Apache Spark (https://github.com/CoxAutomotiveDataSolutions/waimak), which has massively shortened our route from prototype to production. Finally, we’ll define new approaches and best practices about what we believe is the most overlooked aspect of Data Engineering: deploying data pipelines.
Data Seeding via Parameterized API RequestsRapidValue
A quick guide on how to data seed via parameterized API requests. Parameterization is very important for automation testing. It helps you to iterate on input data with multiple data sets that make your scripts reusable and maintainable. In few scenarios, you can still manage with hard coded request but the same approach will not work out where sheer count of combinations is to be validated. By implementing the right solution, you can keep your code base and test data size at ideal range and still savor the benefits of optimal coverage.
The primary focus of this presentation is approaching the migration of a large, legacy data store into a new schema built with Django. Includes discussion of how to structure a migration script so that it will run efficiently and scale. Learn how to recognize and evaluate trouble spots.
Also discusses some general tips and tricks for working with data and establishing a productive workflow.
OLAP on the Cloud with Azure Databricks and Azure SynapseAtScale
This presentation was part of the 2020 Global Summer Azure Data Fest. It explains how Cloud OLAP helps you to analyze large amounts of data on Azure Databricks, Azure Synapse and other data platforms without moving it. And, shows how to leverage AtScale’s Cloud OLAP perform multidimensional analysis – and derive business insights – on data sets from multiple providers – with no data prep or data engineering required.
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...DataStax
Leveraging your operational data for advanced and predictive analytics enables deeper insights and greater value for cloud applications. DSE Analytics is a complete platform for Operational Analytics, including data ingestion, stream processing, batch analysis, and machine learning.
In this talk we will provide an overview of DSE Analytics as it applies to data science tools and techniques, and demonstrate these via real world use cases and examples.
Brian Hess
Rob Murphy
Rocco Varela
About the Speakers
Brian Hess Senior Product Manager, Analytics, DataStax
Brian has been in the analytics space for over 15 years ranging from government to data mining applied research to analytics in enterprise data warehousing and NoSQL engines, in roles ranging from Cryptologic Mathematician to Director of Advanced Analytics to Senior Product Manager. In all these roles he has pushed data analytics and processing to massive scales in order to solve problems that were previously unsolvable.
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411Mark Tabladillo
If you have a SQL Server license (Standard or higher) then you already have the ability to start data mining. In this new presentation, you will see how to scale up data mining from the free Excel 2013 add-in to production use. Aimed at beginning to intermediate data miners, this presentation will show how mining models move from development to production. We will use SQL Server 2014 tools including SSMS, SSIS, and SSDT.
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Anthony Dahanne
Les Buildpacks existent depuis plus de 10 ans ! D’abord, ils étaient utilisés pour détecter et construire une application avant de la déployer sur certains PaaS. Ensuite, nous avons pu créer des images Docker (OCI) avec leur dernière génération, les Cloud Native Buildpacks (CNCF en incubation). Sont-ils une bonne alternative au Dockerfile ? Que sont les buildpacks Paketo ? Quelles communautés les soutiennent et comment ?
Venez le découvrir lors de cette session ignite
Experience our free, in-depth three-part Tendenci Platform Corporate Membership Management workshop series! In Session 1 on May 14th, 2024, we began with an Introduction and Setup, mastering the configuration of your Corporate Membership Module settings to establish membership types, applications, and more. Then, on May 16th, 2024, in Session 2, we focused on binding individual members to a Corporate Membership and Corporate Reps, teaching you how to add individual members and assign Corporate Representatives to manage dues, renewals, and associated members. Finally, on May 28th, 2024, in Session 3, we covered questions and concerns, addressing any queries or issues you may have.
For more Tendenci AMS events, check out www.tendenci.com/events
How Recreation Management Software Can Streamline Your Operations.pptxwottaspaceseo
Recreation management software streamlines operations by automating key tasks such as scheduling, registration, and payment processing, reducing manual workload and errors. It provides centralized management of facilities, classes, and events, ensuring efficient resource allocation and facility usage. The software offers user-friendly online portals for easy access to bookings and program information, enhancing customer experience. Real-time reporting and data analytics deliver insights into attendance and preferences, aiding in strategic decision-making. Additionally, effective communication tools keep participants and staff informed with timely updates. Overall, recreation management software enhances efficiency, improves service delivery, and boosts customer satisfaction.
First Steps with Globus Compute Multi-User EndpointsGlobus
In this presentation we will share our experiences around getting started with the Globus Compute multi-user endpoint. Working with the Pharmacology group at the University of Auckland, we have previously written an application using Globus Compute that can offload computationally expensive steps in the researcher's workflows, which they wish to manage from their familiar Windows environments, onto the NeSI (New Zealand eScience Infrastructure) cluster. Some of the challenges we have encountered were that each researcher had to set up and manage their own single-user globus compute endpoint and that the workloads had varying resource requirements (CPUs, memory and wall time) between different runs. We hope that the multi-user endpoint will help to address these challenges and share an update on our progress here.
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfJay Das
With the advent of artificial intelligence or AI tools, project management processes are undergoing a transformative shift. By using tools like ChatGPT, and Bard organizations can empower their leaders and managers to plan, execute, and monitor projects more effectively.
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Globus
The U.S. Geological Survey (USGS) has made substantial investments in meeting evolving scientific, technical, and policy driven demands on storing, managing, and delivering data. As these demands continue to grow in complexity and scale, the USGS must continue to explore innovative solutions to improve its management, curation, sharing, delivering, and preservation approaches for large-scale research data. Supporting these needs, the USGS has partnered with the University of Chicago-Globus to research and develop advanced repository components and workflows leveraging its current investment in Globus. The primary outcome of this partnership includes the development of a prototype enterprise repository, driven by USGS Data Release requirements, through exploration and implementation of the entire suite of the Globus platform offerings, including Globus Flow, Globus Auth, Globus Transfer, and Globus Search. This presentation will provide insights into this research partnership, introduce the unique requirements and challenges being addressed and provide relevant project progress.
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus
As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.
In software engineering, the right architecture is essential for robust, scalable platforms. Wix has undergone a pivotal shift from event sourcing to a CRUD-based model for its microservices. This talk will chart the course of this pivotal journey.
Event sourcing, which records state changes as immutable events, provided robust auditing and "time travel" debugging for Wix Stores' microservices. Despite its benefits, the complexity it introduced in state management slowed development. Wix responded by adopting a simpler, unified CRUD model. This talk will explore the challenges of event sourcing and the advantages of Wix's new "CRUD on steroids" approach, which streamlines API integration and domain event management while preserving data integrity and system resilience.
Participants will gain valuable insights into Wix's strategies for ensuring atomicity in database updates and event production, as well as caching, materialization, and performance optimization techniques within a distributed system.
Join us to discover how Wix has mastered the art of balancing simplicity and extensibility, and learn how the re-adoption of the modest CRUD has turbocharged their development velocity, resilience, and scalability in a high-growth environment.
Large Language Models and the End of ProgrammingMatt Welsh
Talk by Matt Welsh at Craft Conference 2024 on the impact that Large Language Models will have on the future of software development. In this talk, I discuss the ways in which LLMs will impact the software industry, from replacing human software developers with AI, to replacing conventional software with models that perform reasoning, computation, and problem-solving.
We describe the deployment and use of Globus Compute for remote computation. This content is aimed at researchers who wish to compute on remote resources using a unified programming interface, as well as system administrators who will deploy and operate Globus Compute services on their research computing infrastructure.
How to Position Your Globus Data Portal for Success Ten Good PracticesGlobus
Science gateways allow science and engineering communities to access shared data, software, computing services, and instruments. Science gateways have gained a lot of traction in the last twenty years, as evidenced by projects such as the Science Gateways Community Institute (SGCI) and the Center of Excellence on Science Gateways (SGX3) in the US, The Australian Research Data Commons (ARDC) and its platforms in Australia, and the projects around Virtual Research Environments in Europe. A few mature frameworks have evolved with their different strengths and foci and have been taken up by a larger community such as the Globus Data Portal, Hubzero, Tapis, and Galaxy. However, even when gateways are built on successful frameworks, they continue to face the challenges of ongoing maintenance costs and how to meet the ever-expanding needs of the community they serve with enhanced features. It is not uncommon that gateways with compelling use cases are nonetheless unable to get past the prototype phase and become a full production service, or if they do, they don't survive more than a couple of years. While there is no guaranteed pathway to success, it seems likely that for any gateway there is a need for a strong community and/or solid funding streams to create and sustain its success. With over twenty years of examples to draw from, this presentation goes into detail for ten factors common to successful and enduring gateways that effectively serve as best practices for any new or developing gateway.
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamtakuyayamamoto1800
In this slide, we show the simulation example and the way to compile this solver.
In this solver, the Helmholtz equation can be solved by helmholtzFoam. Also, the Helmholtz equation with uniformly dispersed bubbles can be simulated by helmholtzBubbleFoam.
Developing Distributed High-performance Computing Capabilities of an Open Sci...Globus
COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among public health practitioners, mathematical modelers, and scientific computing specialists, while revealing critical gaps in exploiting advanced computing systems to support urgent decision making. Informed by our team’s work in applying high-performance computing in support of public health decision makers during the COVID-19 pandemic, we present how Globus technologies are enabling the development of an open science platform for robust epidemic analysis, with the goal of collaborative, secure, distributed, on-demand, and fast time-to-solution analyses to support public health.
Quarkus Hidden and Forbidden ExtensionsMax Andersen
Quarkus has a vast extension ecosystem and is known for its subsonic and subatomic feature set. Some of these features are not as well known, and some extensions are less talked about, but that does not make them less interesting - quite the opposite.
Come join this talk to see some tips and tricks for using Quarkus and some of the lesser known features, extensions and development techniques.
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns
Unlocking Business Potential: Tailored Technology Solutions by Prosigns
Discover how Prosigns, a leading technology solutions provider, partners with businesses to drive innovation and success. Our presentation showcases our comprehensive range of services, including custom software development, web and mobile app development, AI & ML solutions, blockchain integration, DevOps services, and Microsoft Dynamics 365 support.
Custom Software Development: Prosigns specializes in creating bespoke software solutions that cater to your unique business needs. Our team of experts works closely with you to understand your requirements and deliver tailor-made software that enhances efficiency and drives growth.
Web and Mobile App Development: From responsive websites to intuitive mobile applications, Prosigns develops cutting-edge solutions that engage users and deliver seamless experiences across devices.
AI & ML Solutions: Harnessing the power of Artificial Intelligence and Machine Learning, Prosigns provides smart solutions that automate processes, provide valuable insights, and drive informed decision-making.
Blockchain Integration: Prosigns offers comprehensive blockchain solutions, including development, integration, and consulting services, enabling businesses to leverage blockchain technology for enhanced security, transparency, and efficiency.
DevOps Services: Prosigns' DevOps services streamline development and operations processes, ensuring faster and more reliable software delivery through automation and continuous integration.
Microsoft Dynamics 365 Support: Prosigns provides comprehensive support and maintenance services for Microsoft Dynamics 365, ensuring your system is always up-to-date, secure, and running smoothly.
Learn how our collaborative approach and dedication to excellence help businesses achieve their goals and stay ahead in today's digital landscape. From concept to deployment, Prosigns is your trusted partner for transforming ideas into reality and unlocking the full potential of your business.
Join us on a journey of innovation and growth. Let's partner for success with Prosigns.
Unleash Unlimited Potential with One-Time Purchase
BoxLang is more than just a language; it's a community. By choosing a Visionary License, you're not just investing in your success, you're actively contributing to the ongoing development and support of BoxLang.
Listen to the keynote address and hear about the latest developments from Rachana Ananthakrishnan and Ian Foster who review the updates to the Globus Platform and Service, and the relevance of Globus to the scientific community as an automation platform to accelerate scientific discovery.
1. Building Full Stack Data Analytics Applications with Kafka and Spark
Agile Data Science 2.0
http://bit.ly/agile_data_slides_4
2. Agile Data Science 2.0
Russell Jurney
2
Data Engineer
Data Scientist
Visualization Software Engineer
85%
85%
85%
Writer
85%
Teacher
50%
Russell Jurney is a veteran data
scientist and thought leader. He
coined the term Agile Data Science in
the book of that name from O’Reilly
in 2012, which outlines the first agile
development methodology for data
science. Russell has constructed
numerous full-stack analytics
products over the past ten years and
now works with clients helping them
extract value from their data assets.
Russell Jurney
Skill
Principal Consultant at Data Syndrome
Russell Jurney
Data Syndrome, LLC
Email : russell.jurney@gmail.com
Web : datasyndrome.com
Principal Consultant
3. Lorem Ipsum dolor siamet suame this placeholder for text can simply
random text. It has roots in a piece of classical. variazioni deiwords which
whichhtly. ven on your zuniga merida della is not denis.
Product Consulting
We build analytics products and systems
consisting of big data viz, predictions,
recommendations, reports and search.
Corporate Training
We offer training courses for data
scientists and engineers and data
science teams,
Video Training
We offer video training courses that rapidly
acclimate you with a technology and
technique.
4. Agile Data Science 2.0 4
What makes data science “agile data science”?
Theory
5. Agile Data Science 2.0 5
Yes. Building applications is a fundamental skill for today’s data scientist.
Data Products or Data Science?
7. Agile Data Science 2.0 7
If someone else has to start over and rebuild it, it ain’t agile.
Big Data or Data Science?
8. Agile Data Science 2.0 8
Goal of
Methodology
The goal of agile data science in <140 characters: to
document and guide exploratory data analysis to
discover and follow the critical path to a compelling
product.
9. Agile Data Science 2.0 9
In analytics, the end-goal moves or is complex in nature, so we model as a
network of tasks rather than as a strictly linear process.
Critical Path
10. Agile Data Science 2.0
Agile Data Science Manifesto
10
Seven Principles for Agile Data Science
Discover and pursue the critical path to a killer product
Iterate, iterate, iterate: tables, charts, reports, predictions1.
Integrate the tyrannical opinion of data in product management4.
Get Meta. Describe the process, not just the end-state7.
Ship intermediate output. Even failed experiments have output2.
Climb up and down the data-value pyramid as we work5.
Prototype experiments over implementing tasks3.
6.
11. Agile Data Science 2.0 11
People will pay more for the things towards the top, but you need the things
on the bottom to have the things above. They are foundational. See:
Maslow’s Theory of Needs.
Data Value Pyramid
13. Agile Data Science 2.0
Agile Data Science 2.0 Stack
13
Apache Spark Apache Kafka MongoDB
Batch and Realtime
Realtime Queue Document Store
Flask
Simple Web App
Example of a high productivity stack for “big” data applications
ElasticSearch
Search
d3.js
Visualization
14. Agile Data Science 2.0
Flow of Data Processing
14
Tools and processes in collecting, refining, publishing and decorating data
{“hello”: “world”}
16. Agile Data Science 2.0 16
SQL or dataflow programming?
Programming Models
17. Agile Data Science 2.0 17
Describing what you want and letting the planner figure out how
SQL
SELECT associations2.object_id,
associations2.term_id, associations2.cat_ID,
associations2.term_taxonomy_id
FROM (SELECT objects_tags.object_id,
objects_tags.term_id, wp_cb_tags2cats.cat_ID,
categories.term_taxonomy_id
FROM (SELECT
wp_term_relationships.object_id,
wp_term_taxonomy.term_id,
wp_term_taxonomy.term_taxonomy_id
FROM wp_term_relationships
LEFT JOIN wp_term_taxonomy ON
wp_term_relationships.term_taxonomy_id =
wp_term_taxonomy.term_taxonomy_id
ORDER BY object_id ASC, term_id ASC)
AS objects_tags
LEFT JOIN wp_cb_tags2cats ON
objects_tags.term_id = wp_cb_tags2cats.tag_ID
LEFT JOIN (SELECT
wp_term_relationships.object_id,
wp_term_taxonomy.term_id as cat_ID,
wp_term_taxonomy.term_taxonomy_id
FROM wp_term_relationships
LEFT JOIN wp_term_taxonomy ON
wp_term_relationships.term_taxonomy_id =
wp_term_taxonomy.term_taxonomy_id
WHERE wp_term_taxonomy.taxonomy =
'category'
GROUP BY object_id, cat_ID,
term_taxonomy_id
ORDER BY object_id, cat_ID,
term_taxonomy_id)
AS categories on wp_cb_tags2cats.cat_ID
= categories.term_id
WHERE objects_tags.term_id =
wp_cb_tags2cats.tag_ID
GROUP BY object_id, term_id, cat_ID,
term_taxonomy_id
ORDER BY object_id ASC, term_id ASC, cat_ID
ASC)
AS associations2
LEFT JOIN categories ON associations2.object_id
= categories.object_id
WHERE associations2.cat_ID <> categories.cat_ID
GROUP BY object_id, term_id, cat_ID,
term_taxonomy_id
ORDER BY object_id, term_id, cat_ID,
term_taxonomy_id
18. Agile Data Science 2.0 18
Flowing data through operations to effect change
Dataflow Programming
19. Agile Data Science 2.0 19
The best of both worlds!
SQL AND
Dataflow
Programming
# Flights that were late arriving...
late_arrivals =
on_time_dataframe.filter(on_time_dataframe.ArrD
elayMinutes > 0)
total_late_arrivals = late_arrivals.count()
# Flights that left late but made up time to
arrive on time...
on_time_heros = on_time_dataframe.filter(
(on_time_dataframe.DepDelayMinutes > 0)
&
(on_time_dataframe.ArrDelayMinutes <= 0)
)
total_on_time_heros = on_time_heros.count()
# Get the percentage of flights that are late,
rounded to 1 decimal place
pct_late = round((total_late_arrivals /
(total_flights * 1.0)) * 100, 1)
print("Total flights:
{:,}".format(total_flights))
print("Late departures:
{:,}".format(total_late_departures))
print("Late arrivals:
{:,}".format(total_late_arrivals))
print("Recoveries:
{:,}".format(total_on_time_heros))
print("Percentage Late: {}%".format(pct_late))
# Why are flights late? Lets look at some
delayed flights and the delay causes
late_flights = spark.sql("""
SELECT
ArrDelayMinutes,
WeatherDelay,
CarrierDelay,
NASDelay,
SecurityDelay,
LateAircraftDelay
FROM
on_time_performance
WHERE
WeatherDelay IS NOT NULL
OR
CarrierDelay IS NOT NULL
OR
NASDelay IS NOT NULL
OR
SecurityDelay IS NOT NULL
OR
LateAircraftDelay IS NOT NULL
ORDER BY
FlightDate
""")
late_flights.sample(False, 0.01).show()
# Calculate the percentage contribution to delay for each source
total_delays = spark.sql("""
SELECT
ROUND(SUM(WeatherDelay)/SUM(ArrDelayMinutes) * 100, 1) AS
pct_weather_delay,
ROUND(SUM(CarrierDelay)/SUM(ArrDelayMinutes) * 100, 1) AS
pct_carrier_delay,
ROUND(SUM(NASDelay)/SUM(ArrDelayMinutes) * 100, 1) AS pct_nas_delay,
ROUND(SUM(SecurityDelay)/SUM(ArrDelayMinutes) * 100, 1) AS
pct_security_delay,
ROUND(SUM(LateAircraftDelay)/SUM(ArrDelayMinutes) * 100, 1) AS
pct_late_aircraft_delay
FROM on_time_performance
""")
total_delays.show()
# Generate a histogram of the weather and carrier delays
weather_delay_histogram = on_time_dataframe
.select("WeatherDelay")
.rdd
.flatMap(lambda x: x)
.histogram(10)
print("{}n{}".format(weather_delay_histogram[0],
weather_delay_histogram[1]))
# Eyeball the first to define our buckets
weather_delay_histogram = on_time_dataframe
.select("WeatherDelay")
.rdd
.flatMap(lambda x: x)
.histogram([1, 15, 30, 60, 120, 240, 480, 720, 24*60.0])
print(weather_delay_histogram)
# Transform the data into something easily consumed by d3
record = {'key': 1, 'data': []}
for label, count in zip(weather_delay_histogram[0],
weather_delay_histogram[1]):
record['data'].append(
{
'label': label,
'count': count
}
)
# Save to Mongo directly, since this is a Tuple not a dataframe or RDD
from pymongo import MongoClient
client = MongoClient()
client.relato.weather_delay_histogram.insert_one(record)
21. Data Syndrome: Agile Data Science 2.0
Collect and Serialize Events in JSON
I never regret using JSON
21
22. Data Syndrome: Agile Data Science 2.0
FAA On-Time Performance Records
95% of commercial flights
22http://www.transtats.bts.gov/Fields.asp?table_id=236
23. Data Syndrome: Agile Data Science 2.0
FAA On-Time Performance Records
95% of commercial flights
23
"Year","Quarter","Month","DayofMonth","DayOfWeek","FlightDate","UniqueCarrier","AirlineID","Carrier","TailNum","FlightNum",
"OriginAirportID","OriginAirportSeqID","OriginCityMarketID","Origin","OriginCityName","OriginState","OriginStateFips",
"OriginStateName","OriginWac","DestAirportID","DestAirportSeqID","DestCityMarketID","Dest","DestCityName","DestState",
"DestStateFips","DestStateName","DestWac","CRSDepTime","DepTime","DepDelay","DepDelayMinutes","DepDel15","DepartureDelayGroups",
"DepTimeBlk","TaxiOut","WheelsOff","WheelsOn","TaxiIn","CRSArrTime","ArrTime","ArrDelay","ArrDelayMinutes","ArrDel15",
"ArrivalDelayGroups","ArrTimeBlk","Cancelled","CancellationCode","Diverted","CRSElapsedTime","ActualElapsedTime","AirTime",
"Flights","Distance","DistanceGroup","CarrierDelay","WeatherDelay","NASDelay","SecurityDelay","LateAircraftDelay",
"FirstDepTime","TotalAddGTime","LongestAddGTime","DivAirportLandings","DivReachedDest","DivActualElapsedTime","DivArrDelay",
"DivDistance","Div1Airport","Div1AirportID","Div1AirportSeqID","Div1WheelsOn","Div1TotalGTime","Div1LongestGTime",
"Div1WheelsOff","Div1TailNum","Div2Airport","Div2AirportID","Div2AirportSeqID","Div2WheelsOn","Div2TotalGTime",
"Div2LongestGTime","Div2WheelsOff","Div2TailNum","Div3Airport","Div3AirportID","Div3AirportSeqID","Div3WheelsOn",
"Div3TotalGTime","Div3LongestGTime","Div3WheelsOff","Div3TailNum","Div4Airport","Div4AirportID","Div4AirportSeqID",
"Div4WheelsOn","Div4TotalGTime","Div4LongestGTime","Div4WheelsOff","Div4TailNum","Div5Airport","Div5AirportID",
"Div5AirportSeqID","Div5WheelsOn","Div5TotalGTime","Div5LongestGTime","Div5WheelsOff","Div5TailNum"
24. Data Syndrome: Agile Data Science 2.0
openflights.org Database
Airports, Airlines, Routes
24
25. Data Syndrome: Agile Data Science 2.0
Scraping the FAA Registry
Airplane Data by Tail Number
25
26. Data Syndrome: Agile Data Science 2.0
Wikipedia Airlines Entries
Descriptions of Airlines
26
27. Data Syndrome: Agile Data Science 2.0
National Centers for Environmental Information
Historical Weather Observations
27
28. Agile Data Science 2.0 28
How to store it and how to fetch it…
Data Structure
and
Access Patterns
29. Data Syndrome: Agile Data Science 2.0
First Order Form
Storing groups of documents in key/value stores
29
Key/Value and some Document Stores
Partition Tolerance
No Single Master
30. Data Syndrome: Agile Data Science 2.0
First Order Form
Storing groups of documents in key/value stores
30
31. Data Syndrome: Agile Data Science 2.0
Second Order Form
Using column stores and compound keys to enable range scans
31
Range Scans for Fun and Profit
One or more Masters
32. Data Syndrome: Agile Data Science 2.0
Second Order Form
Using column stores and compound keys to enable range scans
32
# Achieve everything many access patterns via key composition:
Requirement: SELECT FLIGHTS WHERE FlightDate < X AND FlightDate> Y
Compose Key: TailNum + FlightDate, ex. N16954-2015-12-16
Range Scan: FROM: N16954-2015-12-01
TO: N16954-2015-01-01
33. Data Syndrome: Agile Data Science 2.0
Third Order Form
Using databases with B-Trees to query and analyze raw records
33
SELECT COUNT(*), SUM(*)
…
FROM …
GROUP BY …
WHERE …
34. Agile Data Science 2.0 34
Working our way up the data value pyramid
Climbing the Stack
35. Agile Data Science 2.0 35
Starting by “plumbing” the system from end to end
Plumbing
36. Data Syndrome: Agile Data Science 2.0
Publishing Flight Records
Plumbing our master records through to the web
36
37. Data Syndrome: Agile Data Science 2.0
Publishing Flight Records to MongoDB
Plumbing our master records through to the web
37
import pymongo
import pymongo_spark
# Important: activate pymongo_spark.
pymongo_spark.activate()
# Load the parquet file
on_time_dataframe = spark.read.parquet('data/on_time_performance.parquet')
# Convert to RDD of dicts and save to MongoDB
as_dict = on_time_dataframe.rdd.map(lambda row: row.asDict())
as_dict.saveToMongoDB(‘mongodb://localhost:27017/agile_data_science.on_time_performance')
38. Data Syndrome: Agile Data Science 2.0
Publishing Flight Records to ElasticSearch
Plumbing our master records through to the web
38
# Load the parquet file
on_time_dataframe = spark.read.parquet('data/on_time_performance.parquet')
# Save the DataFrame to Elasticsearch
on_time_dataframe.write.format("org.elasticsearch.spark.sql")
.option("es.resource","agile_data_science/on_time_performance")
.option("es.batch.size.entries","100")
.mode("overwrite")
.save()
39. Data Syndrome: Agile Data Science 2.0
Putting Records on the Web
Plumbing our master records through to the web
39
from flask import Flask, render_template, request
from pymongo import MongoClient
from bson import json_util
# Set up Flask and Mongo
app = Flask(__name__)
client = MongoClient()
# Controller: Fetch an email and display it
@app.route("/on_time_performance")
def on_time_performance():
carrier = request.args.get('Carrier')
flight_date = request.args.get('FlightDate')
flight_num = request.args.get('FlightNum')
flight = client.agile_data_science.on_time_performance.find_one({
'Carrier': carrier,
'FlightDate': flight_date,
'FlightNum': int(flight_num)
})
return json_util.dumps(flight)
if __name__ == "__main__":
app.run(debug=True)
40. Data Syndrome: Agile Data Science 2.0
Putting Records on the Web
Plumbing our master records through to the web
40
41. Data Syndrome: Agile Data Science 2.0
Putting Records on the Web
Plumbing our master records through to the web
41
43. Data Syndrome: Agile Data Science 2.0
Tables in PySpark
Back end development in PySpark
43
# Load the parquet file
on_time_dataframe = spark.read.parquet('data/on_time_performance.parquet')
# Use SQL to look at the total flights by month across 2015
on_time_dataframe.registerTempTable("on_time_dataframe")
total_flights_by_month = spark.sql(
"""SELECT Month, Year, COUNT(*) AS total_flights
FROM on_time_dataframe
GROUP BY Year, Month
ORDER BY Year, Month"""
)
# This map/asDict trick makes the rows print a little prettier. It is optional.
flights_chart_data = total_flights_by_month.map(lambda row: row.asDict())
flights_chart_data.collect()
# Save chart to MongoDB
import pymongo_spark
pymongo_spark.activate()
flights_chart_data.saveToMongoDB(
'mongodb://localhost:27017/agile_data_science.flights_by_month'
)
44. Data Syndrome: Agile Data Science 2.0
Tables in Flask and Jinja2
Front end development in Flask: controller and template
44
# Controller: Fetch a flight table
@app.route("/total_flights")
def total_flights():
total_flights = client.agile_data_science.flights_by_month.find({},
sort = [
('Year', 1),
('Month', 1)
])
return render_template('total_flights.html', total_flights=total_flights)
{% extends "layout.html" %}
{% block body %}
<div>
<p class="lead">Total Flights by Month</p>
<table class="table table-condensed table-striped" style="width: 200px;">
<thead>
<th>Month</th>
<th>Total Flights</th>
</thead>
<tbody>
{% for month in total_flights %}
<tr>
<td>{{month.Month}}</td>
<td>{{month.total_flights}}</td>
</tr>
{% endfor %}
</tbody>
</table>
</div>
{% endblock %}
48. Agile Data Science 2.0 48
Exploring your data through interaction
Reports
49. Data Syndrome: Agile Data Science 2.0
Creating Interactive Ontologies from Semi-Structured Data
Extracting and visualizing entities
49
50. Data Syndrome: Agile Data Science 2.0
Home Page
Extracting and decorating entities
50
51. Data Syndrome: Agile Data Science 2.0
Airline Entity
Extracting and decorating entities
51
52. Data Syndrome: Agile Data Science 2.0
Summarizing Airlines 1.0
Describing entities in aggregate
52
53. Data Syndrome: Agile Data Science 2.0
Summarizing Airlines 2.0
Describing entities in aggregate
53
54. Data Syndrome: Agile Data Science 2.0
Summarizing Airlines 3.0
Describing entities in aggregate
54
55. Data Syndrome: Agile Data Science 2.0
Summarizing Airlines 4.0
Describing entities in aggregate
55
56. Agile Data Science 2.0 56
Predicting the future for fun and profit
Predictions
57. Data Syndrome: Agile Data Science 2.0
Back End Design
Deep Storage and Spark vs Kafka and Spark Streaming
57
/
Batch Realtime
Historical Data
Train Model Apply Model
Realtime Data
58. Data Syndrome: Agile Data Science 2.0 58
jQuery in the web client submits a form to create the prediction request, and
then polls another url every few seconds until the prediction is ready. The
request generates a Kafka event, which a Spark Streaming worker processes
by applying the model we trained in batch. Having done so, it inserts a record
for the prediction in MongoDB, where the Flask app sends it to the web client
the next time it polls the server
Front End Design
/flights/delays/predict/classify_realtime/
59. Data Syndrome: Agile Data Science 2.0
User Interface
Where the user submits prediction requests
59
60. Data Syndrome: Agile Data Science 2.0
String Vectorization
From properties of items to vector format
60
61. Data Syndrome: Agile Data Science 2.0 61
scikit-learn was 166. Spark MLlib is very powerful!
http://bit.ly/train_model_spark
190 Line Model
# !/usr/bin/env python
import sys, os, re
# Pass date and base path to main() from airflow
def main(base_path):
# Default to "."
try: base_path
except NameError: base_path = "."
if not base_path:
base_path = "."
APP_NAME = "train_spark_mllib_model.py"
# If there is no SparkSession, create the environment
try:
sc and spark
except NameError as e:
import findspark
findspark.init()
import pyspark
import pyspark.sql
sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc).builder.appName(APP_NAME).getOrCreate()
#
# {
# "ArrDelay":5.0,"CRSArrTime":"2015-12-31T03:20:00.000-08:00","CRSDepTime":"2015-12-31T03:05:00.000-08:00",
# "Carrier":"WN","DayOfMonth":31,"DayOfWeek":4,"DayOfYear":365,"DepDelay":14.0,"Dest":"SAN","Distance":368.0,
# "FlightDate":"2015-12-30T16:00:00.000-08:00","FlightNum":"6109","Origin":"TUS"
# }
#
from pyspark.sql.types import StringType, IntegerType, FloatType, DoubleType, DateType, TimestampType
from pyspark.sql.types import StructType, StructField
from pyspark.sql.functions import udf
schema = StructType([
StructField("ArrDelay", DoubleType(), True), # "ArrDelay":5.0
StructField("CRSArrTime", TimestampType(), True), # "CRSArrTime":"2015-12-31T03:20:00.000-08:00"
StructField("CRSDepTime", TimestampType(), True), # "CRSDepTime":"2015-12-31T03:05:00.000-08:00"
StructField("Carrier", StringType(), True), # "Carrier":"WN"
StructField("DayOfMonth", IntegerType(), True), # "DayOfMonth":31
StructField("DayOfWeek", IntegerType(), True), # "DayOfWeek":4
StructField("DayOfYear", IntegerType(), True), # "DayOfYear":365
StructField("DepDelay", DoubleType(), True), # "DepDelay":14.0
StructField("Dest", StringType(), True), # "Dest":"SAN"
StructField("Distance", DoubleType(), True), # "Distance":368.0
StructField("FlightDate", DateType(), True), # "FlightDate":"2015-12-30T16:00:00.000-08:00"
StructField("FlightNum", StringType(), True), # "FlightNum":"6109"
StructField("Origin", StringType(), True), # "Origin":"TUS"
])
input_path = "{}/data/simple_flight_delay_features.jsonl.bz2".format(
base_path
)
features = spark.read.json(input_path, schema=schema)
features.first()
#
# Check for nulls in features before using Spark ML
#
null_counts = [(column, features.where(features[column].isNull()).count()) for column in features.columns]
cols_with_nulls = filter(lambda x: x[1] > 0, null_counts)
print(list(cols_with_nulls))
#
# Add a Route variable to replace FlightNum
#
from pyspark.sql.functions import lit, concat
features_with_route = features.withColumn(
'Route',
concat(
features.Origin,
lit('-'),
features.Dest
)
)
features_with_route.show(6)
#
# Use pysmark.ml.feature.Bucketizer to bucketize ArrDelay into on-time, slightly late, very late (0, 1, 2)
#
from pyspark.ml.feature import Bucketizer
# Setup the Bucketizer
splits = [-float("inf"), -15.0, 0, 30.0, float("inf")]
arrival_bucketizer = Bucketizer(
splits=splits,
inputCol="ArrDelay",
outputCol="ArrDelayBucket"
)
# Save the bucketizer
arrival_bucketizer_path = "{}/models/arrival_bucketizer_2.0.bin".format(base_path)
arrival_bucketizer.write().overwrite().save(arrival_bucketizer_path)
# Apply the bucketizer
ml_bucketized_features = arrival_bucketizer.transform(features_with_route)
ml_bucketized_features.select("ArrDelay", "ArrDelayBucket").show()
#
# Extract features tools in with pyspark.ml.feature
#
from pyspark.ml.feature import StringIndexer, VectorAssembler
# Turn category fields into indexes
for column in ["Carrier", "Origin", "Dest", "Route"]:
string_indexer = StringIndexer(
inputCol=column,
outputCol=column + "_index"
)
string_indexer_model = string_indexer.fit(ml_bucketized_features)
ml_bucketized_features = string_indexer_model.transform(ml_bucketized_features)
# Drop the original column
ml_bucketized_features = ml_bucketized_features.drop(column)
# Save the pipeline model
string_indexer_output_path = "{}/models/string_indexer_model_{}.bin".format(
base_path,
column
)
string_indexer_model.write().overwrite().save(string_indexer_output_path)
# Combine continuous, numeric fields with indexes of nominal ones
# ...into one feature vector
numeric_columns = [
"DepDelay", "Distance",
"DayOfMonth", "DayOfWeek",
"DayOfYear"]
index_columns = ["Carrier_index", "Origin_index",
"Dest_index", "Route_index"]
vector_assembler = VectorAssembler(
inputCols=numeric_columns + index_columns,
outputCol="Features_vec"
)
final_vectorized_features = vector_assembler.transform(ml_bucketized_features)
# Save the numeric vector assembler
vector_assembler_path = "{}/models/numeric_vector_assembler.bin".format(base_path)
vector_assembler.write().overwrite().save(vector_assembler_path)
# Drop the index columns
for column in index_columns:
final_vectorized_features = final_vectorized_features.drop(column)
# Inspect the finalized features
final_vectorized_features.show()
# Instantiate and fit random forest classifier on all the data
from pyspark.ml.classification import RandomForestClassifier
rfc = RandomForestClassifier(
featuresCol="Features_vec",
labelCol="ArrDelayBucket",
predictionCol="Prediction",
maxBins=4657,
maxMemoryInMB=1024
)
model = rfc.fit(final_vectorized_features)
# Save the new model over the old one
model_output_path = "{}/models/spark_random_forest_classifier.flight_delays.5.0.bin".format(
base_path
)
model.write().overwrite().save(model_output_path)
# Evaluate model using test data
predictions = model.transform(final_vectorized_features)
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
evaluator = MulticlassClassificationEvaluator(
predictionCol="Prediction",
labelCol="ArrDelayBucket",
metricName="accuracy"
)
accuracy = evaluator.evaluate(predictions)
print("Accuracy = {}".format(accuracy))
# Check the distribution of predictions
predictions.groupBy("Prediction").count().show()
# Check a sample
predictions.sample(False, 0.001, 18).orderBy("CRSDepTime").show(6)
if __name__ == "__main__":
main(sys.argv[1])
62. Data Syndrome: Agile Data Science 2.0
Initializing the Environment
Setting up the environment…
62
# !/usr/bin/env python
import sys, os, re
# Pass date and base path to main() from airflow
def main(base_path):
# Default to "."
try: base_path
except NameError: base_path = "."
if not base_path:
base_path = "."
APP_NAME = "train_spark_mllib_model.py"
# If there is no SparkSession, create the environment
try:
sc and spark
except NameError as e:
import findspark
findspark.init()
import pyspark
import pyspark.sql
sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc).builder.appName(APP_NAME).getOrCreate()
64. Data Syndrome: Agile Data Science 2.0
Checking for Nulls
Checking the data for null values that would crash Spark MLlib
64
#
# Check for nulls in features before using Spark ML
#
null_counts = [(column, features.where(features[column].isNull()).count()) for column in features.columns]
cols_with_nulls = filter(lambda x: x[1] > 0, null_counts)
print(list(cols_with_nulls))
65. Data Syndrome: Agile Data Science 2.0
Adding a Feature
Using DataFrame.withColumn to add a Route feature to the data…
65
#
# Add a Route variable to replace FlightNum
#
from pyspark.sql.functions import lit, concat
features_with_route = features.withColumn(
'Route',
concat(
features.Origin,
lit('-'),
features.Dest
)
)
features_with_route.show(6)
66. Data Syndrome: Agile Data Science 2.0
Bucketizing the Prediction Column
Using Bucketizer to convert a continuous variable to a nominal one…
66
#
# Use pysmark.ml.feature.Bucketizer to bucketize ArrDelay into on-time, slightly late, very late (0, 1, 2)
#
from pyspark.ml.feature import Bucketizer
# Setup the Bucketizer
splits = [-float("inf"), -15.0, 0, 30.0, float("inf")]
arrival_bucketizer = Bucketizer(
splits=splits,
inputCol="ArrDelay",
outputCol="ArrDelayBucket"
)
# Save the bucketizer
arrival_bucketizer_path = "{}/models/arrival_bucketizer_2.0.bin".format(base_path)
arrival_bucketizer.write().overwrite().save(arrival_bucketizer_path)
# Apply the bucketizer
ml_bucketized_features = arrival_bucketizer.transform(features_with_route)
ml_bucketized_features.select("ArrDelay", "ArrDelayBucket").show()
67. Data Syndrome: Agile Data Science 2.0
StringIndexing the String Columns
Using StringIndexer to convert nominal fields to numeric ones…
67
#
# Extract features tools in with pyspark.ml.feature
#
from pyspark.ml.feature import StringIndexer, VectorAssembler
# Turn category fields into indexes
for column in ["Carrier", "Origin", "Dest", "Route"]:
string_indexer = StringIndexer(
inputCol=column,
outputCol=column + "_index"
)
string_indexer_model = string_indexer.fit(ml_bucketized_features)
ml_bucketized_features = string_indexer_model.transform(ml_bucketized_features)
# Drop the original column
ml_bucketized_features = ml_bucketized_features.drop(column)
# Save the pipeline model
string_indexer_output_path = "{}/models/string_indexer_model_{}.bin".format(
base_path,
column
)
string_indexer_model.write().overwrite().save(string_indexer_output_path)
68. Data Syndrome: Agile Data Science 2.0
Vectorizing the Numeric Columns
Combining the numeric fields with VectorAssembler…
68
# Combine continuous, numeric fields with indexes of nominal ones
# ...into one feature vector
numeric_columns = [
"DepDelay", "Distance",
"DayOfMonth", "DayOfWeek",
"DayOfYear"]
index_columns = ["Carrier_index", "Origin_index",
"Dest_index", "Route_index"]
vector_assembler = VectorAssembler(
inputCols=numeric_columns + index_columns,
outputCol="Features_vec"
)
final_vectorized_features = vector_assembler.transform(ml_bucketized_features)
# Save the numeric vector assembler
vector_assembler_path = "{}/models/numeric_vector_assembler.bin".format(base_path)
vector_assembler.write().overwrite().save(vector_assembler_path)
# Drop the index columns
for column in index_columns:
final_vectorized_features = final_vectorized_features.drop(column)
# Inspect the finalized features
final_vectorized_features.show()
69. Data Syndrome: Agile Data Science 2.0
Training the Classifier Model
Creating and training a RandomForestClassifier model
69
# Instantiate and fit random forest classifier on all the data
from pyspark.ml.classification import RandomForestClassifier
rfc = RandomForestClassifier(
featuresCol="Features_vec",
labelCol="ArrDelayBucket",
predictionCol="Prediction",
maxBins=4657,
maxMemoryInMB=1024
)
model = rfc.fit(final_vectorized_features)
# Save the new model over the old one
model_output_path = "{}/models/spark_random_forest_classifier.flight_delays.5.0.bin".format(
base_path
)
model.write().overwrite().save(model_output_path)
70. Data Syndrome: Agile Data Science 2.0
Evaluating the Classifier Model
Using MultiClassificationEvaluator to check the accuracy of the model…
70
# Evaluate model using test data
predictions = model.transform(final_vectorized_features)
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
evaluator = MulticlassClassificationEvaluator(
predictionCol="Prediction",
labelCol="ArrDelayBucket",
metricName="accuracy"
)
accuracy = evaluator.evaluate(predictions)
print("Accuracy = {}".format(accuracy))
# Check the distribution of predictions
predictions.groupBy("Prediction").count().show()
# Check a sample
predictions.sample(False, 0.001, 18).orderBy("CRSDepTime").show(6)
71. Data Syndrome: Agile Data Science 2.0
Running Main
Just what it looks like…
71
if __name__ == "__main__":
main(sys.argv[1])
72. Data Syndrome: Agile Data Science 2.0 72
Using the model in realtime via Spark Streaming!
Deploying the Model
73. Data Syndrome: Agile Data Science 2.0
Loading the Models
Loading the models we trained in batch to reproduce the data pipeline
73
# ch08/make_predictions_streaming.py
# Load the arrival delay bucketizer
from pyspark.ml.feature import Bucketizer
arrival_bucketizer_path = "{}/models/arrival_bucketizer_2.0.bin".format(base_path)
arrival_bucketizer = Bucketizer.load(arrival_bucketizer_path)
# Load all the string field vectorizer pipelines into a dict
from pyspark.ml.feature import StringIndexerModel
string_indexer_models = {}
for column in ["Carrier", "DayOfMonth", "DayOfWeek", "DayOfYear",
"Origin", "Dest", "Route"]:
string_indexer_model_path = "{}/models/string_indexer_model_{}.bin".format(
base_path,
column
)
string_indexer_model = StringIndexerModel.load(string_indexer_model_path)
string_indexer_models[column] = string_indexer_model
74. Data Syndrome: Agile Data Science 2.0
Loading the Models
Loading the models we trained in batch to reproduce the data pipeline
74
# ch08/make_predictions_streaming.py
# Load the numeric vector assembler
from pyspark.ml.feature import VectorAssembler
vector_assembler_path = "{}/models/numeric_vector_assembler.bin".format(base_path)
vector_assembler = VectorAssembler.load(vector_assembler_path)
# Load the classifier model
from pyspark.ml.classification import RandomForestClassifier,
RandomForestClassificationModel
random_forest_model_path = "{}/models/spark_random_forest_classifier.flight_delays.
5.0.bin".format(
base_path
)
rfc = RandomForestClassificationModel.load(
random_forest_model_path
75. Data Syndrome: Agile Data Science 2.0
Connecting to Kafka
Creating a direct stream to the Kafka queue containing our prediction requests
75
#
# Process Prediction Requests in Streaming
#
from pyspark.streaming.kafka import KafkaUtils
stream = KafkaUtils.createDirectStream(
ssc,
[PREDICTION_TOPIC],
{
"metadata.broker.list": BROKERS,
"group.id": "0",
}
)
object_stream = stream.map(lambda x: json.loads(x[1]))
object_stream.pprint()
76. Data Syndrome: Agile Data Science 2.0
Repeating the Pipeline
Running the prediction requests through the same data flow as the training data
76
row_stream = object_stream.map(
lambda x: Row(
FlightDate=iso8601.parse_date(x['FlightDate']),
Origin=x['Origin'],
Distance=x['Distance'],
DayOfMonth=x['DayOfMonth'],
DayOfYear=x['DayOfYear'],
UUID=x['UUID'],
DepDelay=x['DepDelay'],
DayOfWeek=x['DayOfWeek'],
FlightNum=x['FlightNum'],
Dest=x['Dest'],
Timestamp=iso8601.parse_date(x['Timestamp']),
Carrier=x['Carrier']
)
)
row_stream.pprint()
# Do the classification and store to Mongo
row_stream.foreachRDD(classify_prediction_requests)
ssc.start()
ssc.awaitTermination()
77. Data Syndrome: Agile Data Science 2.0
Repeating the Pipeline
Running the prediction requests through the same data flow as the training data
77
def classify_prediction_requests(rdd):
from pyspark.sql.types import StringType, IntegerType, DoubleType, DateType, TimestampType
from pyspark.sql.types import StructType, StructField
prediction_request_schema = StructType([
StructField("Carrier", StringType(), True),
StructField("DayOfMonth", IntegerType(), True),
StructField("DayOfWeek", IntegerType(), True),
StructField("DayOfYear", IntegerType(), True),
StructField("DepDelay", DoubleType(), True),
StructField("Dest", StringType(), True),
StructField("Distance", DoubleType(), True),
StructField("FlightDate", DateType(), True),
StructField("FlightNum", StringType(), True),
StructField("Origin", StringType(), True),
StructField("Timestamp", TimestampType(), True),
StructField("UUID", StringType(), True),
])
prediction_requests_df = spark.createDataFrame(rdd, schema=prediction_request_schema)
prediction_requests_df.show()
from pyspark.sql.functions import lit, concat
prediction_requests_with_route = prediction_requests_df.withColumn(
'Route',
concat(
prediction_requests_df.Origin,
lit('-'),
prediction_requests_df.Dest
)
)
prediction_requests_with_route.show(6)
...
78. Data Syndrome: Agile Data Science 2.0
Repeating the Pipeline
Running the prediction requests through the same data flow as the training data
78
for column in ["Carrier", "DayOfMonth", "DayOfWeek", "DayOfYear",
"Origin", "Dest", "Route"]:
string_indexer_model = string_indexer_models[column]
prediction_requests_with_route = string_indexer_model.transform(prediction_requests_with_route)
# Vectorize numeric columns: DepDelay, Distance and index columns
final_vectorized_features = vector_assembler.transform(prediction_requests_with_route)
# Inspect the vectors
final_vectorized_features.show()
# Drop the individual index columns
index_columns = ["Carrier_index", "DayOfMonth_index", "DayOfWeek_index", "DayOfYear_index",
"Origin_index", "Dest_index", "Route_index"]
for column in index_columns:
final_vectorized_features = final_vectorized_features.drop(column)
# Inspect the finalized features
final_vectorized_features.show()
# Make the prediction
predictions = rfc.transform(final_vectorized_features)
# Drop the features vector and prediction metadata to give the original fields
predictions = predictions.drop("Features_vec")
final_predictions = predictions.drop("indices").drop("values").drop("rawPrediction").drop("probability")
# Inspect the output
final_predictions.show()
79. Data Syndrome: Agile Data Science 2.0
Storing to Mongo
Putting the result where our web application can access it
79
# Store to Mongo
if final_predictions.count() > 0:
final_predictions.rdd.map(lambda x: x.asDict()).saveToMongoDB(
"mongodb://localhost:27017/agile_data_science.flight_delay_classification_response"
)
80. Data Syndrome: Agile Data Science 2.0 80
Experimental setup for iteratively improving the predictive model
Improving the Model
81. Data Syndrome: Agile Data Science 2.0
Experiment Setup
Necessary to improve model
81
82. Data Syndrome: Agile Data Science 2.0 82
155 additional lines to setup an experiment
and add 3 new features to improvement the model
http://bit.ly/improved_model_spark
345 L.O.C.
# !/usr/bin/env python
import sys, os, re
import json
import datetime, iso8601
from tabulate import tabulate
# Pass date and base path to main() from airflow
def main(base_path):
APP_NAME = "train_spark_mllib_model.py"
# If there is no SparkSession, create the environment
try:
sc and spark
except NameError as e:
import findspark
findspark.init()
import pyspark
import pyspark.sql
sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc).builder.appName(APP_NAME).getOrCreate()
#
# {
# "ArrDelay":5.0,"CRSArrTime":"2015-12-31T03:20:00.000-08:00","CRSDepTime":"2015-12-31T03:05:00.000-08:00",
# "Carrier":"WN","DayOfMonth":31,"DayOfWeek":4,"DayOfYear":365,"DepDelay":14.0,"Dest":"SAN","Distance":368.0,
# "FlightDate":"2015-12-30T16:00:00.000-08:00","FlightNum":"6109","Origin":"TUS"
# }
#
from pyspark.sql.types import StringType, IntegerType, FloatType, DoubleType, DateType, TimestampType
from pyspark.sql.types import StructType, StructField
from pyspark.sql.functions import udf
schema = StructType([
StructField("ArrDelay", DoubleType(), True), # "ArrDelay":5.0
StructField("CRSArrTime", TimestampType(), True), # "CRSArrTime":"2015-12-31T03:20:00.000-08:00"
StructField("CRSDepTime", TimestampType(), True), # "CRSDepTime":"2015-12-31T03:05:00.000-08:00"
StructField("Carrier", StringType(), True), # "Carrier":"WN"
StructField("DayOfMonth", IntegerType(), True), # "DayOfMonth":31
StructField("DayOfWeek", IntegerType(), True), # "DayOfWeek":4
StructField("DayOfYear", IntegerType(), True), # "DayOfYear":365
StructField("DepDelay", DoubleType(), True), # "DepDelay":14.0
StructField("Dest", StringType(), True), # "Dest":"SAN"
StructField("Distance", DoubleType(), True), # "Distance":368.0
StructField("FlightDate", DateType(), True), # "FlightDate":"2015-12-30T16:00:00.000-08:00"
StructField("FlightNum", StringType(), True), # "FlightNum":"6109"
StructField("Origin", StringType(), True), # "Origin":"TUS"
])
input_path = "{}/data/simple_flight_delay_features.json".format(
base_path
)
features = spark.read.json(input_path, schema=schema)
features.first()
#
# Add a Route variable to replace FlightNum
#
from pyspark.sql.functions import lit, concat
features_with_route = features.withColumn(
'Route',
concat(
features.Origin,
lit('-'),
features.Dest
)
)
features_with_route.show(6)
#
# Add the hour of day of scheduled arrival/departure
#
from pyspark.sql.functions import hour
features_with_hour = features_with_route.withColumn(
"CRSDepHourOfDay",
hour(features.CRSDepTime)
)
features_with_hour = features_with_hour.withColumn(
"CRSArrHourOfDay",
hour(features.CRSArrTime)
)
features_with_hour.select("CRSDepTime", "CRSDepHourOfDay", "CRSArrTime", "CRSArrHourOfDay").show()
#
# Use pysmark.ml.feature.Bucketizer to bucketize ArrDelay into on-time, slightly late, very late (0, 1, 2)
#
from pyspark.ml.feature import Bucketizer
# Setup the Bucketizer
splits = [-float("inf"), -15.0, 0, 30.0, float("inf")]
arrival_bucketizer = Bucketizer(
splits=splits,
inputCol="ArrDelay",
outputCol="ArrDelayBucket"
)
# Save the model
arrival_bucketizer_path = "{}/models/arrival_bucketizer_2.0.bin".format(base_path)
arrival_bucketizer.write().overwrite().save(arrival_bucketizer_path)
# Apply the model
ml_bucketized_features = arrival_bucketizer.transform(features_with_hour)
ml_bucketized_features.select("ArrDelay", "ArrDelayBucket").show()
#
# Extract features tools in with pyspark.ml.feature
#
from pyspark.ml.feature import StringIndexer, VectorAssembler
# Turn category fields into indexes
for column in ["Carrier", "Origin", "Dest", "Route"]:
string_indexer = StringIndexer(
inputCol=column,
outputCol=column + "_index"
)
string_indexer_model = string_indexer.fit(ml_bucketized_features)
ml_bucketized_features = string_indexer_model.transform(ml_bucketized_features)
# Save the pipeline model
string_indexer_output_path = "{}/models/string_indexer_model_3.0.{}.bin".format(
base_path,
column
)
string_indexer_model.write().overwrite().save(string_indexer_output_path)
# Combine continuous, numeric fields with indexes of nominal ones
# ...into one feature vector
numeric_columns = [
"DepDelay", "Distance",
"DayOfMonth", "DayOfWeek",
"DayOfYear", "CRSDepHourOfDay",
"CRSArrHourOfDay"]
index_columns = ["Carrier_index", "Origin_index",
"Dest_index", "Route_index"]
vector_assembler = VectorAssembler(
inputCols=numeric_columns + index_columns,
outputCol="Features_vec"
)
final_vectorized_features = vector_assembler.transform(ml_bucketized_features)
# Save the numeric vector assembler
vector_assembler_path = "{}/models/numeric_vector_assembler_3.0.bin".format(base_path)
vector_assembler.write().overwrite().save(vector_assembler_path)
# Drop the index columns
for column in index_columns:
final_vectorized_features = final_vectorized_features.drop(column)
# Inspect the finalized features
final_vectorized_features.show()
#
# Cross validate, train and evaluate classifier: loop 5 times for 4 metrics
#
from collections import defaultdict
scores = defaultdict(list)
feature_importances = defaultdict(list)
metric_names = ["accuracy", "weightedPrecision", "weightedRecall", "f1"]
split_count = 3
for i in range(1, split_count + 1):
print("nRun {} out of {} of test/train splits in cross validation...".format(
i,
split_count,
)
)
# Test/train split
training_data, test_data = final_vectorized_features.randomSplit([0.8, 0.2])
# Instantiate and fit random forest classifier on all the data
from pyspark.ml.classification import RandomForestClassifier
rfc = RandomForestClassifier(
featuresCol="Features_vec",
labelCol="ArrDelayBucket",
predictionCol="Prediction",
maxBins=4657,
)
model = rfc.fit(training_data)
# Save the new model over the old one
model_output_path = "{}/models/spark_random_forest_classifier.flight_delays.baseline.bin".format(
base_path
)
model.write().overwrite().save(model_output_path)
# Evaluate model using test data
predictions = model.transform(test_data)
# Evaluate this split's results for each metric
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
for metric_name in metric_names:
evaluator = MulticlassClassificationEvaluator(
labelCol="ArrDelayBucket",
predictionCol="Prediction",
metricName=metric_name
)
score = evaluator.evaluate(predictions)
scores[metric_name].append(score)
print("{} = {}".format(metric_name, score))
#
# Collect feature importances
#
feature_names = vector_assembler.getInputCols()
feature_importance_list = model.featureImportances
for feature_name, feature_importance in zip(feature_names, feature_importance_list):
feature_importances[feature_name].append(feature_importance)
#
# Evaluate average and STD of each metric and print a table
#
import numpy as np
score_averages = defaultdict(float)
# Compute the table data
average_stds = [] # ha
for metric_name in metric_names:
metric_scores = scores[metric_name]
average_accuracy = sum(metric_scores) / len(metric_scores)
score_averages[metric_name] = average_accuracy
std_accuracy = np.std(metric_scores)
average_stds.append((metric_name, average_accuracy, std_accuracy))
# Print the table
print("nExperiment Log")
print("--------------")
print(tabulate(average_stds, headers=["Metric", "Average", "STD"]))
#
# Persist the score to a sccore log that exists between runs
#
import pickle
# Load the score log or initialize an empty one
try:
score_log_filename = "{}/models/score_log.pickle".format(base_path)
score_log = pickle.load(open(score_log_filename, "rb"))
if not isinstance(score_log, list):
score_log = []
except IOError:
score_log = []
# Compute the existing score log entry
score_log_entry = {metric_name: score_averages[metric_name] for metric_name in metric_names}
# Compute and display the change in score for each metric
try:
last_log = score_log[-1]
except (IndexError, TypeError, AttributeError):
last_log = score_log_entry
experiment_report = []
for metric_name in metric_names:
run_delta = score_log_entry[metric_name] - last_log[metric_name]
experiment_report.append((metric_name, run_delta))
print("nExperiment Report")
print("-----------------")
print(tabulate(experiment_report, headers=["Metric", "Score"]))
# Append the existing average scores to the log
score_log.append(score_log_entry)
# Persist the log for next run
pickle.dump(score_log, open(score_log_filename, "wb"))
#
# Analyze and report feature importance changes
#
# Compute averages for each feature
feature_importance_entry = defaultdict(float)
for feature_name, value_list in feature_importances.items():
average_importance = sum(value_list) / len(value_list)
feature_importance_entry[feature_name] = average_importance
# Sort the feature importances in descending order and print
import operator
sorted_feature_importances = sorted(
feature_importance_entry.items(),
key=operator.itemgetter(1),
reverse=True
)
print("nFeature Importances")
print("-------------------")
print(tabulate(sorted_feature_importances, headers=['Name', 'Importance']))
#
# Compare this run's feature importances with the previous run's
#
# Load the feature importance log or initialize an empty one
try:
feature_log_filename = "{}/models/feature_log.pickle".format(base_path)
feature_log = pickle.load(open(feature_log_filename, "rb"))
if not isinstance(feature_log, list):
feature_log = []
except IOError:
feature_log = []
# Compute and display the change in score for each feature
try:
last_feature_log = feature_log[-1]
except (IndexError, TypeError, AttributeError):
last_feature_log = defaultdict(float)
for feature_name, importance in feature_importance_entry.items():
last_feature_log[feature_name] = importance
# Compute the deltas
feature_deltas = {}
for feature_name in feature_importances.keys():
run_delta = feature_importance_entry[feature_name] - last_feature_log[feature_name]
feature_deltas[feature_name] = run_delta
# Sort feature deltas, biggest change first
import operator
sorted_feature_deltas = sorted(
feature_deltas.items(),
key=operator.itemgetter(1),
reverse=True
)
# Display sorted feature deltas
print("nFeature Importance Delta Report")
print("-------------------------------")
print(tabulate(sorted_feature_deltas, headers=["Feature", "Delta"]))
# Append the existing average deltas to the log
feature_log.append(feature_importance_entry)
# Persist the log for next run
pickle.dump(feature_log, open(feature_log_filename, "wb"))
if __name__ == "__main__":
main(sys.argv[1])
83. Data Syndrome: Agile Data Science 2.0
Creating an Experiment
Cross validate, train and evaluate classifier
83
from collections import defaultdict
scores = defaultdict(list)
feature_importances = defaultdict(list)
metric_names = ["accuracy", "weightedPrecision", "weightedRecall", "f1"]
split_count = 3
for i in range(1, split_count + 1):
print("nRun {} out of {} of test/train splits in cross validation...".format(
i,
split_count,
)
)
# Test/train split
training_data, test_data = final_vectorized_features.randomSplit([0.8, 0.2])
# Instantiate and fit random forest classifier on all the data
from pyspark.ml.classification import RandomForestClassifier
rfc = RandomForestClassifier(
featuresCol="Features_vec",
labelCol="ArrDelayBucket",
predictionCol="Prediction",
maxBins=4657,
)
model = rfc.fit(training_data)
84. Data Syndrome: Agile Data Science 2.0
Creating an Experiment
Cross validate, train and evaluate classifier
84
# Save the new model over the old one
model_output_path = "{}/models/spark_random_forest_classifier.flight_delays.baseline.bin".format(
base_path
)
model.write().overwrite().save(model_output_path)
# Evaluate model using test data
predictions = model.transform(test_data)
# Evaluate this split's results for each metric
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
for metric_name in metric_names:
evaluator = MulticlassClassificationEvaluator(
labelCol="ArrDelayBucket",
predictionCol="Prediction",
metricName=metric_name
)
score = evaluator.evaluate(predictions)
scores[metric_name].append(score)
print("{} = {}".format(metric_name, score))
85. Data Syndrome: Agile Data Science 2.0
Creating an Experiment
Cross validate, train and evaluate classifier
85
#
# Collect feature importances
#
feature_names = vector_assembler.getInputCols()
feature_importance_list = model.featureImportances
for feature_name, feature_importance in zip(feature_names, feature_importance_list):
feature_importances[feature_name].append(feature_importance)
#
# Evaluate average and STD of each metric and print a table
#
import numpy as np
score_averages = defaultdict(float)
# Compute the table data
average_stds = [] # ha
for metric_name in metric_names:
metric_scores = scores[metric_name]
average_accuracy = sum(metric_scores) / len(metric_scores)
score_averages[metric_name] = average_accuracy
std_accuracy = np.std(metric_scores)
average_stds.append((metric_name, average_accuracy, std_accuracy))
86. Data Syndrome: Agile Data Science 2.0
Creating an Experiment
Cross validate, train and evaluate classifier
86
# Print the table
print("nExperiment Log")
print("--------------")
print(tabulate(average_stds, headers=["Metric", "Average", "STD"]))
#
# Persist the score to a sccore log that exists between runs
#
import pickle
# Load the score log or initialize an empty one
try:
score_log_filename = "{}/models/score_log.pickle".format(base_path)
score_log = pickle.load(open(score_log_filename, "rb"))
if not isinstance(score_log, list):
score_log = []
except IOError:
score_log = []
# Compute the existing score log entry
score_log_entry = {metric_name: score_averages[metric_name] for metric_name in metric_names}
87. Data Syndrome: Agile Data Science 2.0
Creating an Experiment
Cross validate, train and evaluate classifier
87
# Compute and display the change in score for each metric
try:
last_log = score_log[-1]
except (IndexError, TypeError, AttributeError):
last_log = score_log_entry
experiment_report = []
for metric_name in metric_names:
run_delta = score_log_entry[metric_name] - last_log[metric_name]
experiment_report.append((metric_name, run_delta))
print("nExperiment Report")
print("-----------------")
print(tabulate(experiment_report, headers=["Metric", "Score"]))
# Append the existing average scores to the log
score_log.append(score_log_entry)
# Persist the log for next run
pickle.dump(score_log, open(score_log_filename, "wb"))
88. Data Syndrome: Agile Data Science 2.0
Creating an Experiment
Cross validate, train and evaluate classifier
88
#
# Analyze and report feature importance changes
#
# Compute averages for each feature
feature_importance_entry = defaultdict(float)
for feature_name, value_list in feature_importances.items():
average_importance = sum(value_list) / len(value_list)
feature_importance_entry[feature_name] = average_importance
# Sort the feature importances in descending order and print
import operator
sorted_feature_importances = sorted(
feature_importance_entry.items(),
key=operator.itemgetter(1),
reverse=True
)
print("nFeature Importances")
print("-------------------")
print(tabulate(sorted_feature_importances, headers=['Name', 'Importance']))
89. Data Syndrome: Agile Data Science 2.0
Creating an Experiment
Cross validate, train and evaluate classifier
89
#
# Compare this run's feature importances with the previous run's
#
# Load the feature importance log or initialize an empty one
try:
feature_log_filename = "{}/models/feature_log.pickle".format(base_path)
feature_log = pickle.load(open(feature_log_filename, "rb"))
if not isinstance(feature_log, list):
feature_log = []
except IOError:
feature_log = []
# Compute and display the change in score for each feature
try:
last_feature_log = feature_log[-1]
except (IndexError, TypeError, AttributeError):
last_feature_log = defaultdict(float)
for feature_name, importance in feature_importance_entry.items():
last_feature_log[feature_name] = importance
90. Data Syndrome: Agile Data Science 2.0
Creating an Experiment
Cross validate, train and evaluate classifier
90
# Compute the deltas
feature_deltas = {}
for feature_name in feature_importances.keys():
run_delta = feature_importance_entry[feature_name] - last_feature_log[feature_name]
feature_deltas[feature_name] = run_delta
# Sort feature deltas, biggest change first
import operator
sorted_feature_deltas = sorted(
feature_deltas.items(),
key=operator.itemgetter(1),
reverse=True
)
# Display sorted feature deltas
print("nFeature Importance Delta Report")
print("-------------------------------")
print(tabulate(sorted_feature_deltas, headers=["Feature", "Delta"]))
# Append the existing average deltas to the log
feature_log.append(feature_importance_entry)
# Persist the log for next run
pickle.dump(feature_log, open(feature_log_filename, "wb"))
91. Data Syndrome: Agile Data Science 2.0
Running an Experiment
Cross validate, train and evaluate classifier
91
Experiment Log
--------------
Metric Average STD
----------------- --------- -----------
accuracy 0.594443 0.000382382
weightedPrecision 0.642419 0.00352101
weightedRecall 0.594443 0.000382382
f1 0.522397 0.000438121
92. Data Syndrome: Agile Data Science 2.0
Comparing Experiments
Cross validate, train and evaluate classifier
92
Experiment Report
-----------------
Metric Score
----------------- -----------
accuracy 0.00300548
weightedPrecision -0.00592227
weightedRecall 0.00300548
f1 -0.0105553
95. Data Syndrome: Agile Data Science 2.0 95
Next steps for learning more about Agile Data Science 2.0
Next Steps
96. Building Full-Stack Data Analytics Applications with Spark
http://bit.ly/agile_data_science
Available Now on O’Reilly Safari: http://bit.ly/agile_data_safari
Agile Data Science 2.0
97. Agile Data Science 2.0 97
Realtime Predictive
Analytics
Rapidly learn to build entire predictive systems driven by
Kafka, PySpark, Speak Streaming, Spark MLlib and with a web
front-end using Python/Flask and JQuery.
Available for purchase at http://datasyndrome.com/video
98. Data Syndrome Russell Jurney
Principal Consultant
Email : rjurney@datasyndrome.com
Web : datasyndrome.com
Data Syndrome, LLC
Product Consulting
We build analytics products
and systems consisting of
big data viz, predictions,
recommendations, reports
and search.
Corporate Training
We offer training courses
for data scientists and
engineers and data
science teams,
Video Training
We offer video training
courses that rapidly
acclimate you with a
technology and technique.