Agile Data Science 2.0 (O'Reilly 2017) defines a methodology and a software stack with which to apply the methods. *The methodology* seeks to deliver data products in short sprints by going meta and putting the focus on the applied research process itself. *The stack* is but an example of one meeting the requirements that it be utterly scalable and utterly efficient in use by application developers as well as data engineers. It includes everything needed to build a full-blown predictive system: Apache Spark, Apache Kafka, Apache Incubating Airflow, MongoDB, ElasticSearch, Apache Parquet, Python/Flask, JQuery. This talk will cover the full lifecycle of large data application development and will show how to use lessons from agile software engineering to apply data science using this full-stack to build better analytics applications. The entire lifecycle of big data application development is discussed. The system starts with plumbing, moving on to data tables, charts and search, through interactive reports, and building towards predictions in both batch and realtime (and defining the role for both), the deployment of predictive systems and how to iteratively improve predictions that prove valuable.
Agile Data Science 2.0 covers the theory and practice of applying agile methods to the practice of applied analytics research called data science. The book takes the stance that data products are the preferred output format for data science teams to effect change in an organization. Accordingly, we show how to "get meta" to enable agility in building applications describing the applied research process itself. Then we show how to use 'big data' tools to iteratively build, deploy and refine analytics applications. Tracking data-product development through the five stages of the "data value pyramid", we show you how to build applications from conception through development through deployment and then through iterative improvement. Application development is a fundamental skill for a data scientist, and by publishing your data science work as a web application, we show you how to effect maximal change within your organization.
Technologies covered include Python, Apache Spark (Spark MLlib, Spark Streaming), Apache Kafka, MongoDB, ElasticSearch and Apache Airflow.
Agile Data Science 2.0 (O’Reilly 2017) defines a methodology and a software stack with which to apply the methods. The methodology seeks to deliver data products in short sprints by going meta and putting the focus on the applied research process itself. The stack is but an example of one meeting the requirements that it be utterly scalable and utterly efficient in use by application developers as well as data engineers. It includes everything needed to build a full-blown predictive system: Apache Spark, Apache Kafka, Apache Incubating Airflow, MongoDB, ElasticSearch, Apache Parquet, Python/Flask, JQuery. This talk will cover the full lifecycle of large data application development and will show how to use lessons from agile software engineering to apply data science using this full-stack to build better analytics applications.
Agile Data Science 2.0 (O'Reilly 2017) defines a methodology and a software stack with which to apply the methods. *The methodology* seeks to deliver data products in short sprints by going meta and putting the focus on the applied research process itself. *The stack* is but an example of one meeting the requirements that it be utterly scalable and utterly efficient in use by application developers as well as data engineers. It includes everything needed to build a full-blown predictive system: Apache Spark, Apache Kafka, Apache Incubating Airflow, MongoDB, ElasticSearch, Apache Parquet, Python/Flask, JQuery. This talk will cover the full lifecycle of large data application development and will show how to use lessons from agile software engineering to apply data science using this full-stack to build better analytics applications. The entire lifecycle of big data application development is discussed. The system starts with plumbing, moving on to data tables, charts and search, through interactive reports, and building towards predictions in both batch and realtime (and defining the role for both), the deployment of predictive systems and how to iteratively improve predictions that prove valuable.
Agile Data Science 2.0 covers the theory and practice of applying agile methods to the practice of applied analytics research called data science. The book takes the stance that data products are the preferred output format for data science teams to effect change in an organization. Accordingly, we show how to "get meta" to enable agility in building applications describing the applied research process itself. Then we show how to use 'big data' tools to iteratively build, deploy and refine analytics applications. Tracking data-product development through the five stages of the "data value pyramid", we show you how to build applications from conception through development through deployment and then through iterative improvement. Application development is a fundamental skill for a data scientist, and by publishing your data science work as a web application, we show you how to effect maximal change within your organization.
Technologies covered include Python, Apache Spark (Spark MLlib, Spark Streaming), Apache Kafka, MongoDB, ElasticSearch and Apache Airflow.
Agile Data Science 2.0 (O’Reilly 2017) defines a methodology and a software stack with which to apply the methods. The methodology seeks to deliver data products in short sprints by going meta and putting the focus on the applied research process itself. The stack is but an example of one meeting the requirements that it be utterly scalable and utterly efficient in use by application developers as well as data engineers. It includes everything needed to build a full-blown predictive system: Apache Spark, Apache Kafka, Apache Incubating Airflow, MongoDB, ElasticSearch, Apache Parquet, Python/Flask, JQuery. This talk will cover the full lifecycle of large data application development and will show how to use lessons from agile software engineering to apply data science using this full-stack to build better analytics applications.
Agile Data Science 2.0 (O'Reilly 2017) defines a methodology and a software stack with which to apply the methods. *The methodology* seeks to deliver data products in short sprints by going meta and putting the focus on the applied research process itself. *The stack* is but an example of one meeting the requirements that it be utterly scalable and utterly efficient in use by application developers as well as data engineers. It includes everything needed to build a full-blown predictive system: Apache Spark, Apache Kafka, Apache Incubating Airflow, MongoDB, ElasticSearch, Apache Parquet, Python/Flask, JQuery. This talk will cover the full lifecycle of large data application development and will show how to use lessons from agile software engineering to apply data science using this full-stack to build better analytics applications. The entire lifecycle of big data application development is discussed. The system starts with plumbing, moving on to data tables, charts and search, through interactive reports, and building towards predictions in both batch and realtime (and defining the role for both), the deployment of predictive systems and how to iteratively improve predictions that prove valuable.
Slides for Data Syndrome one hour course on PySpark. Introduces basic operations, Spark SQL, Spark MLlib and exploratory data analysis with PySpark. Shows how to use pylab with Spark to create histograms.
Networks All Around Us: Extracting networks from your problem domainRussell Jurney
Network analytics are being increasingly utilized to create machine intelligence that automates the world around us. But what is a network, and how do you analyze them? More directly: how do I find and analyze networks in my dataset? This talk will go over a number of examples of practical network analytics to give viewers a playbook for doing applied social network analysis and network analytics.
Running Intelligent Applications inside a Database: Deep Learning with Python...Miguel González-Fierro
In this talk we present a new paradigm of computation where the intelligence is computed inside the database. Standard software systems must get the data from the database to execute a routine. If the size of the data is big, there are inefficiencies due to the data movement. Store procedures tried to solve this issue in the past, allowing for computing simple functions inside the database. However, only simple routines can be executed.
To showcase the capabilities of our new system, we created a lung cancer detection algorithm using Microsoft’s Cognitive Toolkit, also known as CNTK. We used transfer learning between ImageNet dataset, which contains natural images, and a lung cancer dataset, which contains scans of horizontal sections of the lung for healthy and sick patients. Specifically, a pretrained Convolutional Neural Network on ImageNet is used on the lung cancer dataset to generate features. Once the features are computed, a boosted tree is applied to predict whether the patient has cancer or not.
All this process is computed inside the database, so the data movement is minimized. We are even able to execute the algorithm using the GPU of the virtual machine that hosts the database. Using a GPU, we can compute the featurization in less than 1h, in contrast to using a CPU, that would take up to 32h. Finally, we set up an API to connect the solution to a web app, where a doctor can analyze the images and get a prediction of a patient.
DF1 - R - Natekin - Improving Daily Analysis with data.tableMoscowDataFest
Presentation from Moscow Data Fest #1, September 12.
Moscow Data Fest is a free one-day event that brings together Data Scientists for sessions on both theory and practice.
Link: http://www.meetup.com/Moscow-Data-Fest/
Data science with Windows Azure - A Brief IntroductionAdnan Masood
Data Science with Windows Azure is an introduction to HDInsight and Hadoop offerings from Microsoft Machine Learning and Big Data Cloud based platform. This was presented at Microsoft Data Science Group – Tampa Analytics Professionals.
Applied Machine learning using H2O, python and R WorkshopAvkash Chauhan
Note: Get all workshop content at - https://github.com/h2oai/h2o-meetups/tree/master/2017_02_22_Seattle_STC_Meetup
Basic knowledge of R/python and general ML concepts
Note: This is bring-your-own-laptop workshop. Make sure you bring your laptop in order to be able to participate in the workshop
Level: 200
Time: 2 Hours
Agenda:
- Introduction to ML, H2O and Sparkling Water
- Refresher of data manipulation in R & Python
- Supervised learning
---- Understanding liner regression model with an example
---- Understanding binomial classification with an example
---- Understanding multinomial classification with an example
- Unsupervised learning
---- Understanding k-means clustering with an example
- Using machine learning models in production
- Sparkling Water Introduction & Demo
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Big Data Spain
Hadoop clusters can store nearly everything in a cheap and blazingly fast way to your data lake. Answering questions and gaining insights out of this ever growing stream becomes the decisive part for many businesses.
https://www.bigdataspain.org/2017/talk/fishing-graphs-in-a-hadoop-data-lake
Big Data Spain 2017
16th - 17th November Kinépolis Madrid
Hadoop clusters can store nearly everything in a cheap and blazingly fast way to your data lake. Answering questions and gaining insights out of this ever growing stream becomes the decisive part for many businesses. Increasingly data has a natural structure as a graph, with vertices linked by edges, and many questions arising about the data involve graph traversals or other complex queries, for which one does not have an a priori given bound on the length of paths.
Enabling Multimodel Graphs with Apache TinkerPopJason Plurad
Graphs are everywhere, but in a modern data stack, they are not the only tool in the toolbox. With Apache TinkerPop, adding graph capability on top of your existing data platform is not as daunting as it sounds. We will do a deep dive on writing Traversal Strategies to optimize performance of the underlying graph database. We will investigate how various TinkerPop systems offer unique possibilities in a multimodel approach to graph processing. We will discuss how using Gremlin frees you from vendor lock-in and enables you to swap out your graph database as your requirements evolve. Presented at Graph Day Texas, January 14, 2017. http://graphday.com/graph-day-at-data-day-texas/#plurad
See 2020 update: https://derwen.ai/s/h88s
SF Python Meetup, 2017-02-08
https://www.meetup.com/sfpython/events/237153246/
PyTextRank is a pure Python open source implementation of *TextRank*, based on the [Mihalcea 2004 paper](http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf) -- a graph algorithm which produces ranked keyphrases from texts. Keyphrases generally more useful than simple keyword extraction. PyTextRank integrates use of `TextBlob` and `SpaCy` for NLP analysis of texts, including full parse, named entity extraction, etc. It also produces auto-summarization of texts, making use of an approximation algorithm, `MinHash`, for better performance at scale. Overall, the package is intended to complement machine learning approaches -- specifically deep learning used for custom search and recommendations -- by developing better feature vectors from raw texts. This package is in production use at O'Reilly Media for text analytics.
Slides for Data Syndrome one hour course on PySpark. Introduces basic operations, Spark SQL, Spark MLlib and exploratory data analysis with PySpark. Shows how to use pylab with Spark to create histograms.
Networks All Around Us: Extracting networks from your problem domainRussell Jurney
Network analytics are being increasingly utilized to create machine intelligence that automates the world around us. But what is a network, and how do you analyze them? More directly: how do I find and analyze networks in my dataset? This talk will go over a number of examples of practical network analytics to give viewers a playbook for doing applied social network analysis and network analytics.
Running Intelligent Applications inside a Database: Deep Learning with Python...Miguel González-Fierro
In this talk we present a new paradigm of computation where the intelligence is computed inside the database. Standard software systems must get the data from the database to execute a routine. If the size of the data is big, there are inefficiencies due to the data movement. Store procedures tried to solve this issue in the past, allowing for computing simple functions inside the database. However, only simple routines can be executed.
To showcase the capabilities of our new system, we created a lung cancer detection algorithm using Microsoft’s Cognitive Toolkit, also known as CNTK. We used transfer learning between ImageNet dataset, which contains natural images, and a lung cancer dataset, which contains scans of horizontal sections of the lung for healthy and sick patients. Specifically, a pretrained Convolutional Neural Network on ImageNet is used on the lung cancer dataset to generate features. Once the features are computed, a boosted tree is applied to predict whether the patient has cancer or not.
All this process is computed inside the database, so the data movement is minimized. We are even able to execute the algorithm using the GPU of the virtual machine that hosts the database. Using a GPU, we can compute the featurization in less than 1h, in contrast to using a CPU, that would take up to 32h. Finally, we set up an API to connect the solution to a web app, where a doctor can analyze the images and get a prediction of a patient.
DF1 - R - Natekin - Improving Daily Analysis with data.tableMoscowDataFest
Presentation from Moscow Data Fest #1, September 12.
Moscow Data Fest is a free one-day event that brings together Data Scientists for sessions on both theory and practice.
Link: http://www.meetup.com/Moscow-Data-Fest/
Data science with Windows Azure - A Brief IntroductionAdnan Masood
Data Science with Windows Azure is an introduction to HDInsight and Hadoop offerings from Microsoft Machine Learning and Big Data Cloud based platform. This was presented at Microsoft Data Science Group – Tampa Analytics Professionals.
Applied Machine learning using H2O, python and R WorkshopAvkash Chauhan
Note: Get all workshop content at - https://github.com/h2oai/h2o-meetups/tree/master/2017_02_22_Seattle_STC_Meetup
Basic knowledge of R/python and general ML concepts
Note: This is bring-your-own-laptop workshop. Make sure you bring your laptop in order to be able to participate in the workshop
Level: 200
Time: 2 Hours
Agenda:
- Introduction to ML, H2O and Sparkling Water
- Refresher of data manipulation in R & Python
- Supervised learning
---- Understanding liner regression model with an example
---- Understanding binomial classification with an example
---- Understanding multinomial classification with an example
- Unsupervised learning
---- Understanding k-means clustering with an example
- Using machine learning models in production
- Sparkling Water Introduction & Demo
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Big Data Spain
Hadoop clusters can store nearly everything in a cheap and blazingly fast way to your data lake. Answering questions and gaining insights out of this ever growing stream becomes the decisive part for many businesses.
https://www.bigdataspain.org/2017/talk/fishing-graphs-in-a-hadoop-data-lake
Big Data Spain 2017
16th - 17th November Kinépolis Madrid
Hadoop clusters can store nearly everything in a cheap and blazingly fast way to your data lake. Answering questions and gaining insights out of this ever growing stream becomes the decisive part for many businesses. Increasingly data has a natural structure as a graph, with vertices linked by edges, and many questions arising about the data involve graph traversals or other complex queries, for which one does not have an a priori given bound on the length of paths.
Enabling Multimodel Graphs with Apache TinkerPopJason Plurad
Graphs are everywhere, but in a modern data stack, they are not the only tool in the toolbox. With Apache TinkerPop, adding graph capability on top of your existing data platform is not as daunting as it sounds. We will do a deep dive on writing Traversal Strategies to optimize performance of the underlying graph database. We will investigate how various TinkerPop systems offer unique possibilities in a multimodel approach to graph processing. We will discuss how using Gremlin frees you from vendor lock-in and enables you to swap out your graph database as your requirements evolve. Presented at Graph Day Texas, January 14, 2017. http://graphday.com/graph-day-at-data-day-texas/#plurad
See 2020 update: https://derwen.ai/s/h88s
SF Python Meetup, 2017-02-08
https://www.meetup.com/sfpython/events/237153246/
PyTextRank is a pure Python open source implementation of *TextRank*, based on the [Mihalcea 2004 paper](http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf) -- a graph algorithm which produces ranked keyphrases from texts. Keyphrases generally more useful than simple keyword extraction. PyTextRank integrates use of `TextBlob` and `SpaCy` for NLP analysis of texts, including full parse, named entity extraction, etc. It also produces auto-summarization of texts, making use of an approximation algorithm, `MinHash`, for better performance at scale. Overall, the package is intended to complement machine learning approaches -- specifically deep learning used for custom search and recommendations -- by developing better feature vectors from raw texts. This package is in production use at O'Reilly Media for text analytics.
It's all about introduction to a blog which speaks about Destinations, Arts, Culture, People, Cuisines...Everything you would want to know about Kerala
Discover Life. Feel Divinity. Find Yourself...........Experience God's Own Country
An overview of Teraproc cluster-as-a-service offerings for high-performance distributed analytics. This overview presentation includes a step-by-step demonstration of the process of deploying a ready-to-run R Studio cluster environment on Amazon Web Services. More information available at http://teraproc.com
Top Insights from SaaStr by Leading Enterprise Software ExpertsOpenView
Market Research
SHARE
I had the pleasure of attending the SaaStr Annual 2016 Conference in San Francisco earlier this month and wanted to share some of the insights I gathered from that event with you here. The findings below are arranged by functional area with attribution. I tried to compress the content as much as possible, but there was A TON of great information at the conference so would highly recommend spending the time to read through.
Detailed report of IBM's 30TB Hadoop-DS report showing that IBM InfoSphere BigInsights (SQL-on-Hadoop) is able to execute all 99 TPC-DS queries at scale over native Hadoop data formats. Written by Simon Harris, Abhayan Sundararajan, John Poelman and Matthew Emmerton.
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
Many data pipelines share common characteristics and are often built in similar but bespoke ways, even within a single organisation. In this talk, we will outline the key considerations which need to be applied when building data pipelines, such as performance, idempotency, reproducibility, and tackling the small file problem. We’ll work towards describing a common Data Engineering toolkit which separates these concerns from business logic code, allowing non-Data-Engineers (e.g. Business Analysts and Data Scientists) to define data pipelines without worrying about the nitty-gritty production considerations.
We’ll then introduce an implementation of such a toolkit in the form of Waimak, our open-source library for Apache Spark (https://github.com/CoxAutomotiveDataSolutions/waimak), which has massively shortened our route from prototype to production. Finally, we’ll define new approaches and best practices about what we believe is the most overlooked aspect of Data Engineering: deploying data pipelines.
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioAlluxio, Inc.
Alluxio Global Online Meetup
Apr 23, 2020
For more Alluxio events: https://www.alluxio.io/events/
Speakers:
Jiao (Jennie) Wang, Intel
Tsai Louie, Intel
Bin Fan, Alluxio
Today, many people run deep learning applications with training data from separate storage such as object storage or remote data centers. This presentation will demo the Intel Analytics Zoo + Alluxio stack, an architecture that enables high performance while keeping cost and resource efficiency balanced without network being I/O bottlenecked.
Intel Analytics Zoo is a unified data analytics and AI platform open-sourced by Intel. It seamlessly unites TensorFlow, Keras, PyTorch, Spark, Flink, and Ray programs into an integrated pipeline, which can transparently scale from a laptop to large clusters to process production big data. Alluxio, as an open-source data orchestration layer, accelerates data loading and processing in Analytics Zoo deep learning applications.
This talk, we will go over:
- What is Analytics Zoo and how it works
- How to run Analytics Zoo with Alluxio in deep learning applications
- Initial performance benchmark results using the Analytics Zoo + Alluxio stack
Founding committer of Spark, Patrick Wendell, gave this talk at 2015 Strata London about Apache Spark.
These slides provides an introduction to Spark, and delves into future developments, including DataFrames, Datasource API, Catalyst logical optimizer, and Project Tungsten.
The primary focus of this presentation is approaching the migration of a large, legacy data store into a new schema built with Django. Includes discussion of how to structure a migration script so that it will run efficiently and scale. Learn how to recognize and evaluate trouble spots.
Also discusses some general tips and tricks for working with data and establishing a productive workflow.
Intershop Commerce Management with Microsoft SQL ServerMauro Boffardi
Presentation during inPulse 2018 event, by Jens Kleinschmidt (Technical Product Manager / Architect of Intershop) , describing the choice of offering Microsoft SQL Server with the new 7.10 version of Commerce Suite.
Boulder/Denver BigData: Cluster Computing with Apache Mesos and CascadingPaco Nathan
Presentation to the Boulder/Denver BigData meetup 2013-09-25 http://www.meetup.com/Boulder-Denver-Big-Data/events/131047972/
Overview of Enterprise Data Workflows with Cascading; code samples in Cascading, Cascalog, Scalding; Lingual and Pattern Examples; An Evolution of Cluster Computing based on Apache Mesos, with use cases
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
London Spark Meetup 2014-11-11 @Skimlinks
http://www.meetup.com/Spark-London/events/217362972/
To paraphrase the immortal crooner Don Ho: "Tiny Batches, in the wine, make me happy, make me feel fine." http://youtu.be/mlCiDEXuxxA
Apache Spark provides support for streaming use cases, such as real-time analytics on log files, by leveraging a model called discretized streams (D-Streams). These "micro batch" computations operated on small time intervals, generally from 500 milliseconds up. One major innovation of Spark Streaming is that it leverages a unified engine. In other words, the same business logic can be used across multiple uses cases: streaming, but also interactive, iterative, machine learning, etc.
This talk will compare case studies for production deployments of Spark Streaming, emerging design patterns for integration with popular complementary OSS frameworks, plus some of the more advanced features such as approximation algorithms, and take a look at what's ahead — including the new Python support for Spark Streaming that will be in the upcoming 1.2 release.
Also, let's chat a bit about the new Databricks + O'Reilly developer certification for Apache Spark…
Jump Start into Apache® Spark™ and DatabricksDatabricks
These are the slides from the Jump Start into Apache Spark and Databricks webinar on February 10th, 2016.
---
Spark is a fast, easy to use, and unified engine that allows you to solve many Data Sciences and Big Data (and many not-so-Big Data) scenarios easily. Spark comes packaged with higher-level libraries, including support for SQL queries, streaming data, machine learning, and graph processing. We will leverage Databricks to quickly and easily demonstrate, visualize, and debug our code samples; the notebooks will be available for you to download.
Solr is the popular, blazing fast open Source Enterprise search platform from the Apache LuceneTM project. Its major features include powerful full-text search, hit highlighting, faceted search, near real-time indexing, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more. Solr powers the search and navigation features of many of the world's largest internet sites like (Aol, Yahoo, Buy.com, Cnet, CitySearch, Netflix, Zappos, Stubhub!, digg, eTrade, Disney, Apple, NASA and MTV).
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...Jürgen Ambrosi
In questa sessione vedremo, con il solito approccio pratico di demo hands on, come utilizzare il linguaggio R per effettuare analisi a valore aggiunto,
Toccheremo con mano le performance di parallelizzazione degli algoritmi, aspetto fondamentale per aiutare il ricercatore nel raggiungimento dei suoi obbiettivi.
In questa sessione avremo la partecipazione di Lorenzo Casucci, Data Platform Solution Architect di Microsoft.
What’s new in Spark 2.0?
Rerngvit Yanggratoke @ Combient AB
Örjan Lundberg @ Combient AB
Machine Learning Stockholm Meetup
27 October, 2016
Schibsted Media Group
Experience our free, in-depth three-part Tendenci Platform Corporate Membership Management workshop series! In Session 1 on May 14th, 2024, we began with an Introduction and Setup, mastering the configuration of your Corporate Membership Module settings to establish membership types, applications, and more. Then, on May 16th, 2024, in Session 2, we focused on binding individual members to a Corporate Membership and Corporate Reps, teaching you how to add individual members and assign Corporate Representatives to manage dues, renewals, and associated members. Finally, on May 28th, 2024, in Session 3, we covered questions and concerns, addressing any queries or issues you may have.
For more Tendenci AMS events, check out www.tendenci.com/events
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Mind IT Systems
Healthcare providers often struggle with the complexities of chronic conditions and remote patient monitoring, as each patient requires personalized care and ongoing monitoring. Off-the-shelf solutions may not meet these diverse needs, leading to inefficiencies and gaps in care. It’s here, custom healthcare software offers a tailored solution, ensuring improved care and effectiveness.
Understanding Globus Data Transfers with NetSageGlobus
NetSage is an open privacy-aware network measurement, analysis, and visualization service designed to help end-users visualize and reason about large data transfers. NetSage traditionally has used a combination of passive measurements, including SNMP and flow data, as well as active measurements, mainly perfSONAR, to provide longitudinal network performance data visualization. It has been deployed by dozens of networks world wide, and is supported domestically by the Engagement and Performance Operations Center (EPOC), NSF #2328479. We have recently expanded the NetSage data sources to include logs for Globus data transfers, following the same privacy-preserving approach as for Flow data. Using the logs for the Texas Advanced Computing Center (TACC) as an example, this talk will walk through several different example use cases that NetSage can answer, including: Who is using Globus to share data with my institution, and what kind of performance are they able to achieve? How many transfers has Globus supported for us? Which sites are we sharing the most data with, and how is that changing over time? How is my site using Globus to move data internally, and what kind of performance do we see for those transfers? What percentage of data transfers at my institution used Globus, and how did the overall data transfer performance compare to the Globus users?
Globus Connect Server Deep Dive - GlobusWorld 2024Globus
We explore the Globus Connect Server (GCS) architecture and experiment with advanced configuration options and use cases. This content is targeted at system administrators who are familiar with GCS and currently operate—or are planning to operate—broader deployments at their institution.
Software Engineering, Software Consulting, Tech Lead.
Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Security,
Spring Transaction, Spring MVC,
Log4j, REST/SOAP WEB-SERVICES.
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTier1 app
Even though at surface level ‘java.lang.OutOfMemoryError’ appears as one single error; underlyingly there are 9 types of OutOfMemoryError. Each type of OutOfMemoryError has different causes, diagnosis approaches and solutions. This session equips you with the knowledge, tools, and techniques needed to troubleshoot and conquer OutOfMemoryError in all its forms, ensuring smoother, more efficient Java applications.
Developing Distributed High-performance Computing Capabilities of an Open Sci...Globus
COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among public health practitioners, mathematical modelers, and scientific computing specialists, while revealing critical gaps in exploiting advanced computing systems to support urgent decision making. Informed by our team’s work in applying high-performance computing in support of public health decision makers during the COVID-19 pandemic, we present how Globus technologies are enabling the development of an open science platform for robust epidemic analysis, with the goal of collaborative, secure, distributed, on-demand, and fast time-to-solution analyses to support public health.
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisGlobus
JASMIN is the UK’s high-performance data analysis platform for environmental science, operated by STFC on behalf of the UK Natural Environment Research Council (NERC). In addition to its role in hosting the CEDA Archive (NERC’s long-term repository for climate, atmospheric science & Earth observation data in the UK), JASMIN provides a collaborative platform to a community of around 2,000 scientists in the UK and beyond, providing nearly 400 environmental science projects with working space, compute resources and tools to facilitate their work. High-performance data transfer into and out of JASMIN has always been a key feature, with many scientists bringing model outputs from supercomputers elsewhere in the UK, to analyse against observational or other model data in the CEDA Archive. A growing number of JASMIN users are now realising the benefits of using the Globus service to provide reliable and efficient data movement and other tasks in this and other contexts. Further use cases involve long-distance (intercontinental) transfers to and from JASMIN, and collecting results from a mobile atmospheric radar system, pushing data to JASMIN via a lightweight Globus deployment. We provide details of how Globus fits into our current infrastructure, our experience of the recent migration to GCSv5.4, and of our interest in developing use of the wider ecosystem of Globus services for the benefit of our user community.
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Shahin Sheidaei
Games are powerful teaching tools, fostering hands-on engagement and fun. But they require careful consideration to succeed. Join me to explore factors in running and selecting games, ensuring they serve as effective teaching tools. Learn to maintain focus on learning objectives while playing, and how to measure the ROI of gaming in education. Discover strategies for pitching gaming to leadership. This session offers insights, tips, and examples for coaches, team leads, and enterprise leaders seeking to teach from simple to complex concepts.
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxrickgrimesss22
Discover the essential features to incorporate in your Winzo clone app to boost business growth, enhance user engagement, and drive revenue. Learn how to create a compelling gaming experience that stands out in the competitive market.
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Globus
The Earth System Grid Federation (ESGF) is a global network of data servers that archives and distributes the planet’s largest collection of Earth system model output for thousands of climate and environmental scientists worldwide. Many of these petabyte-scale data archives are located in proximity to large high-performance computing (HPC) or cloud computing resources, but the primary workflow for data users consists of transferring data, and applying computations on a different system. As a part of the ESGF 2.0 US project (funded by the United States Department of Energy Office of Science), we developed pre-defined data workflows, which can be run on-demand, capable of applying many data reduction and data analysis to the large ESGF data archives, transferring only the resultant analysis (ex. visualizations, smaller data files). In this talk, we will showcase a few of these workflows, highlighting how Globus Flows can be used for petabyte-scale climate analysis.
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Anthony Dahanne
Les Buildpacks existent depuis plus de 10 ans ! D’abord, ils étaient utilisés pour détecter et construire une application avant de la déployer sur certains PaaS. Ensuite, nous avons pu créer des images Docker (OCI) avec leur dernière génération, les Cloud Native Buildpacks (CNCF en incubation). Sont-ils une bonne alternative au Dockerfile ? Que sont les buildpacks Paketo ? Quelles communautés les soutiennent et comment ?
Venez le découvrir lors de cette session ignite
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Globus
Large Language Models (LLMs) are currently the center of attention in the tech world, particularly for their potential to advance research. In this presentation, we'll explore a straightforward and effective method for quickly initiating inference runs on supercomputers using the vLLM tool with Globus Compute, specifically on the Polaris system at ALCF. We'll begin by briefly discussing the popularity and applications of LLMs in various fields. Following this, we will introduce the vLLM tool, and explain how it integrates with Globus Compute to efficiently manage LLM operations on Polaris. Attendees will learn the practical aspects of setting up and remotely triggering LLMs from local machines, focusing on ease of use and efficiency. This talk is ideal for researchers and practitioners looking to leverage the power of LLMs in their work, offering a clear guide to harnessing supercomputing resources for quick and effective LLM inference.
Accelerate Enterprise Software Engineering with PlatformlessWSO2
Key takeaways:
Challenges of building platforms and the benefits of platformless.
Key principles of platformless, including API-first, cloud-native middleware, platform engineering, and developer experience.
How Choreo enables the platformless experience.
How key concepts like application architecture, domain-driven design, zero trust, and cell-based architecture are inherently a part of Choreo.
Demo of an end-to-end app built and deployed on Choreo.
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Globus
The U.S. Geological Survey (USGS) has made substantial investments in meeting evolving scientific, technical, and policy driven demands on storing, managing, and delivering data. As these demands continue to grow in complexity and scale, the USGS must continue to explore innovative solutions to improve its management, curation, sharing, delivering, and preservation approaches for large-scale research data. Supporting these needs, the USGS has partnered with the University of Chicago-Globus to research and develop advanced repository components and workflows leveraging its current investment in Globus. The primary outcome of this partnership includes the development of a prototype enterprise repository, driven by USGS Data Release requirements, through exploration and implementation of the entire suite of the Globus platform offerings, including Globus Flow, Globus Auth, Globus Transfer, and Globus Search. This presentation will provide insights into this research partnership, introduce the unique requirements and challenges being addressed and provide relevant project progress.
How to Position Your Globus Data Portal for Success Ten Good PracticesGlobus
Science gateways allow science and engineering communities to access shared data, software, computing services, and instruments. Science gateways have gained a lot of traction in the last twenty years, as evidenced by projects such as the Science Gateways Community Institute (SGCI) and the Center of Excellence on Science Gateways (SGX3) in the US, The Australian Research Data Commons (ARDC) and its platforms in Australia, and the projects around Virtual Research Environments in Europe. A few mature frameworks have evolved with their different strengths and foci and have been taken up by a larger community such as the Globus Data Portal, Hubzero, Tapis, and Galaxy. However, even when gateways are built on successful frameworks, they continue to face the challenges of ongoing maintenance costs and how to meet the ever-expanding needs of the community they serve with enhanced features. It is not uncommon that gateways with compelling use cases are nonetheless unable to get past the prototype phase and become a full production service, or if they do, they don't survive more than a couple of years. While there is no guaranteed pathway to success, it seems likely that for any gateway there is a need for a strong community and/or solid funding streams to create and sustain its success. With over twenty years of examples to draw from, this presentation goes into detail for ten factors common to successful and enduring gateways that effectively serve as best practices for any new or developing gateway.
Cyaniclab : Software Development Agency Portfolio.pdfCyanic lab
CyanicLab, an offshore custom software development company based in Sweden,India, Finland, is your go-to partner for startup development and innovative web design solutions. Our expert team specializes in crafting cutting-edge software tailored to meet the unique needs of startups and established enterprises alike. From conceptualization to execution, we offer comprehensive services including web and mobile app development, UI/UX design, and ongoing software maintenance. Ready to elevate your business? Contact CyanicLab today and let us propel your vision to success with our top-notch IT solutions.
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...Juraj Vysvader
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I didn't get rich from it but it did have 63K downloads (powered possible tens of thousands of websites).
Large Language Models and the End of ProgrammingMatt Welsh
Talk by Matt Welsh at Craft Conference 2024 on the impact that Large Language Models will have on the future of software development. In this talk, I discuss the ways in which LLMs will impact the software industry, from replacing human software developers with AI, to replacing conventional software with models that perform reasoning, computation, and problem-solving.
1. Building Full Stack Data Analytics Applications with Kafka and Spark
Agile Data Science 2.0
2. Agile Data Science 2.0
Russell Jurney
2
Data Engineer
Data Scientist
Visualization Software Engineer
85%
85%
85%
Writer
85%
Teacher
50%
Russell Jurney is a veteran data
scientist and thought leader. He
coined the term Agile Data Science in
the book of that name from O’Reilly
in 2012, which outlines the first agile
development methodology for data
science. Russell has constructed
numerous full-stack analytics
products over the past ten years and
now works with clients helping them
extract value from their data assets.
Russell Jurney
Skill
Principal Consultant at Data Syndrome
Russell Jurney
Data Syndrome, LLC
Email : russell.jurney@gmail.com
Web : datasyndrome.com
Principal Consultant
3. Building Full-Stack Data Analytics Applications with Spark
http://bit.ly/agile_data_science
Agile Data Science 2.0
4. Agile Data Science 2.0 4
Realtime Predictive
Analytics
Rapidly learn to build entire predictive systems driven by
Kafka, PySpark, Speak Streaming, Spark MLlib and with a web
front-end using Python/Flask and JQuery.
Available for purchase at http://datasyndrome.com/video
5. Lorem Ipsum dolor siamet suame this placeholder for text can simply
random text. It has roots in a piece of classical. variazioni deiwords which
whichhtly. ven on your zuniga merida della is not denis.
Product Consulting
We build analytics products and systems
consisting of big data viz, predictions,
recommendations, reports and search.
Corporate Training
We offer training courses for data
scientists and engineers and data
science teams,
Video Training
We offer video training courses that rapidly
acclimate you with a technology and
technique.
6. Agile Data Science 2.0 6
What makes data science “agile data science”?
Theory
7. Agile Data Science 2.0 7
Yes. Building applications is a fundamental skill for today’s data scientist.
Data Products or Data Science?
8. Agile Data Science 2.0 8
If someone else has to start over and rebuild it, it ain’t agile.
Big Data or Data Science?
9. Agile Data Science 2.0 9
Goal of
Methodology
The goal of agile data science in <140 characters: to
document and guide exploratory data analysis to
discover and follow the critical path to a compelling
product.
10. Agile Data Science 2.0 10
In analytics, the end-goal moves or is complex in nature, so we model as a
network of tasks rather than as a strictly linear process.
Critical Path
11. Agile Data Science 2.0
Agile Data Science Manifesto
11
Seven Principles for Agile Data Science
Discover and pursue the critical path to a killer product
Iterate, iterate, iterate: tables, charts, reports, predictions1.
Integrate the tyrannical opinion of data in product management4.
Get Meta. Describe the process, not just the end-state7.
Ship intermediate output. Even failed experiments have output2.
Climb up and down the data-value pyramid as we work5.
Prototype experiments over implementing tasks3.
6.
12. Agile Data Science 2.0 12
People will pay more for the things towards the top, but you need the things
on the bottom to have the things above. They are foundational. See:
Maslow’s Theory of Needs.
Data Value Pyramid
14. Agile Data Science 2.0
Agile Data Science 2.0 Stack
14
Apache Spark Apache Kafka MongoDB Flask Apache Incubating
Airflow
Batch and Realtime
Realtime Queue Document Store Simple Web App Batch Deployment
Example of a high productivity stack for “big” data applications
15. Agile Data Science 2.0
Flow of Data Processing
15
Tools and processes in collecting, refining, publishing and decorating data
{“hello”: “world”}
17. Agile Data Science 2.0 17
SQL or dataflow programming?
Programming Models
18. Agile Data Science 2.0 18
Describing what you want and letting the planner figure out how
SQL
SELECT associations2.object_id,
associations2.term_id, associations2.cat_ID,
associations2.term_taxonomy_id
FROM (SELECT objects_tags.object_id,
objects_tags.term_id, wp_cb_tags2cats.cat_ID,
categories.term_taxonomy_id
FROM (SELECT
wp_term_relationships.object_id,
wp_term_taxonomy.term_id,
wp_term_taxonomy.term_taxonomy_id
FROM wp_term_relationships
LEFT JOIN wp_term_taxonomy ON
wp_term_relationships.term_taxonomy_id =
wp_term_taxonomy.term_taxonomy_id
ORDER BY object_id ASC, term_id ASC)
AS objects_tags
LEFT JOIN wp_cb_tags2cats ON
objects_tags.term_id = wp_cb_tags2cats.tag_ID
LEFT JOIN (SELECT
wp_term_relationships.object_id,
wp_term_taxonomy.term_id as cat_ID,
wp_term_taxonomy.term_taxonomy_id
FROM wp_term_relationships
LEFT JOIN wp_term_taxonomy ON
wp_term_relationships.term_taxonomy_id =
wp_term_taxonomy.term_taxonomy_id
WHERE wp_term_taxonomy.taxonomy =
'category'
GROUP BY object_id, cat_ID,
term_taxonomy_id
ORDER BY object_id, cat_ID,
term_taxonomy_id)
AS categories on wp_cb_tags2cats.cat_ID
= categories.term_id
WHERE objects_tags.term_id =
wp_cb_tags2cats.tag_ID
GROUP BY object_id, term_id, cat_ID,
term_taxonomy_id
ORDER BY object_id ASC, term_id ASC, cat_ID
ASC)
AS associations2
LEFT JOIN categories ON associations2.object_id
= categories.object_id
WHERE associations2.cat_ID <> categories.cat_ID
GROUP BY object_id, term_id, cat_ID,
term_taxonomy_id
ORDER BY object_id, term_id, cat_ID,
term_taxonomy_id
19. Agile Data Science 2.0 19
Flowing data through operations to effect change
Dataflow Programming
20. Agile Data Science 2.0 20
The best of both worlds!
SQL AND
Dataflow Programming
# Flights that were late arriving...
late_arrivals =
on_time_dataframe.filter(on_time_dataframe.ArrD
elayMinutes > 0)
total_late_arrivals = late_arrivals.count()
# Flights that left late but made up time to
arrive on time...
on_time_heros = on_time_dataframe.filter(
(on_time_dataframe.DepDelayMinutes > 0)
&
(on_time_dataframe.ArrDelayMinutes <= 0)
)
total_on_time_heros = on_time_heros.count()
# Get the percentage of flights that are late,
rounded to 1 decimal place
pct_late = round((total_late_arrivals /
(total_flights * 1.0)) * 100, 1)
print("Total flights:
{:,}".format(total_flights))
print("Late departures:
{:,}".format(total_late_departures))
print("Late arrivals:
{:,}".format(total_late_arrivals))
print("Recoveries:
{:,}".format(total_on_time_heros))
print("Percentage Late: {}%".format(pct_late))
# Why are flights late? Lets look at some
delayed flights and the delay causes
late_flights = spark.sql("""
SELECT
ArrDelayMinutes,
WeatherDelay,
CarrierDelay,
NASDelay,
SecurityDelay,
LateAircraftDelay
FROM
on_time_performance
WHERE
WeatherDelay IS NOT NULL
OR
CarrierDelay IS NOT NULL
OR
NASDelay IS NOT NULL
OR
SecurityDelay IS NOT NULL
OR
LateAircraftDelay IS NOT NULL
ORDER BY
FlightDate
""")
late_flights.sample(False, 0.01).show()
# Calculate the percentage contribution to delay
for each source
total_delays = spark.sql("""
SELECT
ROUND(SUM(WeatherDelay)/SUM(ArrDelayMinutes) *
100, 1) AS pct_weather_delay,
ROUND(SUM(CarrierDelay)/SUM(ArrDelayMinutes) *
100, 1) AS pct_carrier_delay,
ROUND(SUM(NASDelay)/SUM(ArrDelayMinutes) * 100,
1) AS pct_nas_delay,
ROUND(SUM(SecurityDelay)/SUM(ArrDelayMinutes) *
100, 1) AS pct_security_delay,
ROUND(SUM(LateAircraftDelay)/SUM(ArrDelayMinutes)
* 100, 1) AS pct_late_aircraft_delay
FROM on_time_performance
""")
total_delays.show()
# Generate a histogram of the weather and carrier
delays
weather_delay_histogram = on_time_dataframe
.select("WeatherDelay")
.rdd
.flatMap(lambda x: x)
.histogram(10)
print("{}n{}".format(weather_delay_histogram[0],
weather_delay_histogram[1]))
# Eyeball the first to define our buckets
weather_delay_histogram = on_time_dataframe
.select("WeatherDelay")
.rdd
.flatMap(lambda x: x)
.histogram([1, 15, 30, 60, 120, 240, 480, 720,
24*60.0])
print(weather_delay_histogram)
# Transform the data into something easily consumed
by d3
record = {'key': 1, 'data': []}
for label, count in zip(weather_delay_histogram[0],
weather_delay_histogram[1]):
record['data'].append(
{
'label': label,
'count': count
}
)
# Save to Mongo directly, since this is a Tuple not
a dataframe or RDD
from pymongo import MongoClient
client = MongoClient()
client.relato.weather_delay_histogram.insert_one(re
cord)
22. Data Syndrome: Agile Data Science 2.0
Collect and Serialize Events in JSON
I never regret using JSON
22
23. Data Syndrome: Agile Data Science 2.0
FAA On-Time Performance Records
95% of commercial flights
23http://www.transtats.bts.gov/Fields.asp?table_id=236
24. Data Syndrome: Agile Data Science 2.0
FAA On-Time Performance Records
95% of commercial flights
24
"Year","Quarter","Month","DayofMonth","DayOfWeek","FlightDate","UniqueCarrier","AirlineID","Carrier","TailNum","FlightNum",
"OriginAirportID","OriginAirportSeqID","OriginCityMarketID","Origin","OriginCityName","OriginState","OriginStateFips",
"OriginStateName","OriginWac","DestAirportID","DestAirportSeqID","DestCityMarketID","Dest","DestCityName","DestState",
"DestStateFips","DestStateName","DestWac","CRSDepTime","DepTime","DepDelay","DepDelayMinutes","DepDel15","DepartureDelayGroups",
"DepTimeBlk","TaxiOut","WheelsOff","WheelsOn","TaxiIn","CRSArrTime","ArrTime","ArrDelay","ArrDelayMinutes","ArrDel15",
"ArrivalDelayGroups","ArrTimeBlk","Cancelled","CancellationCode","Diverted","CRSElapsedTime","ActualElapsedTime","AirTime",
"Flights","Distance","DistanceGroup","CarrierDelay","WeatherDelay","NASDelay","SecurityDelay","LateAircraftDelay",
"FirstDepTime","TotalAddGTime","LongestAddGTime","DivAirportLandings","DivReachedDest","DivActualElapsedTime","DivArrDelay",
"DivDistance","Div1Airport","Div1AirportID","Div1AirportSeqID","Div1WheelsOn","Div1TotalGTime","Div1LongestGTime",
"Div1WheelsOff","Div1TailNum","Div2Airport","Div2AirportID","Div2AirportSeqID","Div2WheelsOn","Div2TotalGTime",
"Div2LongestGTime","Div2WheelsOff","Div2TailNum","Div3Airport","Div3AirportID","Div3AirportSeqID","Div3WheelsOn",
"Div3TotalGTime","Div3LongestGTime","Div3WheelsOff","Div3TailNum","Div4Airport","Div4AirportID","Div4AirportSeqID",
"Div4WheelsOn","Div4TotalGTime","Div4LongestGTime","Div4WheelsOff","Div4TailNum","Div5Airport","Div5AirportID",
"Div5AirportSeqID","Div5WheelsOn","Div5TotalGTime","Div5LongestGTime","Div5WheelsOff","Div5TailNum"
25. Data Syndrome: Agile Data Science 2.0
openflights.org Database
Airports, Airlines, Routes
25
26. Data Syndrome: Agile Data Science 2.0
Scraping the FAA Registry
Airplane Data by Tail Number
26
27. Data Syndrome: Agile Data Science 2.0
Wikipedia Airlines Entries
Descriptions of Airlines
27
28. Data Syndrome: Agile Data Science 2.0
National Centers for Environmental Information
Historical Weather Observations
28
29. Agile Data Science 2.0 29
Working our way up the data value pyramid
Climbing the Stack
30. Agile Data Science 2.0 30
Starting by “plumbing” the system from end to end
Plumbing
31. Data Syndrome: Agile Data Science 2.0
Publishing Flight Records
Plumbing our master records through to the web
31
32. Data Syndrome: Agile Data Science 2.0
Publishing Flight Records to MongoDB
Plumbing our master records through to the web
32
import pymongo
import pymongo_spark
# Important: activate pymongo_spark.
pymongo_spark.activate()
# Load the parquet file
on_time_dataframe = spark.read.parquet('data/on_time_performance.parquet')
# Convert to RDD of dicts and save to MongoDB
as_dict = on_time_dataframe.rdd.map(lambda row: row.asDict())
as_dict.saveToMongoDB(‘mongodb://localhost:27017/agile_data_science.on_time_performance')
33. Data Syndrome: Agile Data Science 2.0
Publishing Flight Records to ElasticSearch
Plumbing our master records through to the web
33
# Load the parquet file
on_time_dataframe = spark.read.parquet('data/on_time_performance.parquet')
# Save the DataFrame to Elasticsearch
on_time_dataframe.write.format("org.elasticsearch.spark.sql")
.option("es.resource","agile_data_science/on_time_performance")
.option("es.batch.size.entries","100")
.mode("overwrite")
.save()
34. Data Syndrome: Agile Data Science 2.0
Putting Records on the Web
Plumbing our master records through to the web
34
from flask import Flask, render_template, request
from pymongo import MongoClient
from bson import json_util
# Set up Flask and Mongo
app = Flask(__name__)
client = MongoClient()
# Controller: Fetch an email and display it
@app.route("/on_time_performance")
def on_time_performance():
carrier = request.args.get('Carrier')
flight_date = request.args.get('FlightDate')
flight_num = request.args.get('FlightNum')
flight = client.agile_data_science.on_time_performance.find_one({
'Carrier': carrier,
'FlightDate': flight_date,
'FlightNum': int(flight_num)
})
return json_util.dumps(flight)
if __name__ == "__main__":
app.run(debug=True)
35. Data Syndrome: Agile Data Science 2.0
Putting Records on the Web
Plumbing our master records through to the web
35
36. Data Syndrome: Agile Data Science 2.0
Putting Records on the Web
Plumbing our master records through to the web
36
38. Data Syndrome: Agile Data Science 2.0
Tables in PySpark
Back end development in PySpark
38
# Load the parquet file
on_time_dataframe = spark.read.parquet('data/on_time_performance.parquet')
# Use SQL to look at the total flights by month across 2015
on_time_dataframe.registerTempTable("on_time_dataframe")
total_flights_by_month = spark.sql(
"""SELECT Month, Year, COUNT(*) AS total_flights
FROM on_time_dataframe
GROUP BY Year, Month
ORDER BY Year, Month"""
)
# This map/asDict trick makes the rows print a little prettier. It is optional.
flights_chart_data = total_flights_by_month.map(lambda row: row.asDict())
flights_chart_data.collect()
# Save chart to MongoDB
import pymongo_spark
pymongo_spark.activate()
flights_chart_data.saveToMongoDB(
'mongodb://localhost:27017/agile_data_science.flights_by_month'
)
39. Data Syndrome: Agile Data Science 2.0
Tables in Flask and Jinja2
Front end development in Flask: controller and template
39
# Controller: Fetch a flight table
@app.route("/total_flights")
def total_flights():
total_flights = client.agile_data_science.flights_by_month.find({},
sort = [
('Year', 1),
('Month', 1)
])
return render_template('total_flights.html', total_flights=total_flights)
{% extends "layout.html" %}
{% block body %}
<div>
<p class="lead">Total Flights by Month</p>
<table class="table table-condensed table-striped" style="width: 200px;">
<thead>
<th>Month</th>
<th>Total Flights</th>
</thead>
<tbody>
{% for month in total_flights %}
<tr>
<td>{{month.Month}}</td>
<td>{{month.total_flights}}</td>
</tr>
{% endfor %}
</tbody>
</table>
</div>
{% endblock %}
43. Agile Data Science 2.0 43
Exploring your data through interaction
Reports
44. Data Syndrome: Agile Data Science 2.0
Creating Interactive Ontologies from Semi-Structured Data
Extracting and visualizing entities
44
45. Data Syndrome: Agile Data Science 2.0
Home Page
Extracting and decorating entities
45
46. Data Syndrome: Agile Data Science 2.0
Airline Entity
Extracting and decorating entities
46
47. Data Syndrome: Agile Data Science 2.0
Summarizing Airlines 1.0
Describing entities in aggregate
47
48. Data Syndrome: Agile Data Science 2.0
Summarizing Airlines 2.0
Describing entities in aggregate
48
49. Data Syndrome: Agile Data Science 2.0
Summarizing Airlines 3.0
Describing entities in aggregate
49
50. Data Syndrome: Agile Data Science 2.0
Summarizing Airlines 4.0
Describing entities in aggregate
50
51. Agile Data Science 2.0 51
Predicting the future for fun and profit
Predictions
52. Data Syndrome: Agile Data Science 2.0
Back End Design
Deep Storage and Spark vs Kafka and Spark Streaming
52
/
Batch Realtime
Historical Data
Train Model Apply Model
Realtime Data
53. Data Syndrome: Agile Data Science 2.0 53
jQuery in the web client submits a form to create the prediction request, and
then polls another url every few seconds until the prediction is ready. The
request generates a Kafka event, which a Spark Streaming worker processes
by applying the model we trained in batch. Having done so, it inserts a record
for the prediction in MongoDB, where the Flask app sends it to the web client
the next time it polls the server
Front End Design
/flights/delays/predict/classify_realtime/
54. Data Syndrome: Agile Data Science 2.0
String Vectorization
From properties of items to vector format
54
55. Data Syndrome: Agile Data Science 2.0 55
scikit-learn was 166. Spark MLlib is very powerful!
190 Line Model
# !/usr/bin/env python
import sys, os, re
# Pass date and base path to main() from airflow
def main(base_path):
# Default to "."
try: base_path
except NameError: base_path = "."
if not base_path:
base_path = "."
APP_NAME = "train_spark_mllib_model.py"
# If there is no SparkSession, create the environment
try:
sc and spark
except NameError as e:
import findspark
findspark.init()
import pyspark
import pyspark.sql
sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc).builder.appName(APP_NAME).getOrCreate()
#
# {
# "ArrDelay":5.0,"CRSArrTime":"2015-12-31T03:20:00.000-08:00","CRSDepTime":"2015-12-31T03:05:00.000-08:00",
# "Carrier":"WN","DayOfMonth":31,"DayOfWeek":4,"DayOfYear":365,"DepDelay":14.0,"Dest":"SAN","Distance":368.0,
# "FlightDate":"2015-12-30T16:00:00.000-08:00","FlightNum":"6109","Origin":"TUS"
# }
#
from pyspark.sql.types import StringType, IntegerType, FloatType, DoubleType, DateType, TimestampType
from pyspark.sql.types import StructType, StructField
from pyspark.sql.functions import udf
schema = StructType([
StructField("ArrDelay", DoubleType(), True), # "ArrDelay":5.0
StructField("CRSArrTime", TimestampType(), True), # "CRSArrTime":"2015-12-31T03:20:00.000-08:00"
StructField("CRSDepTime", TimestampType(), True), # "CRSDepTime":"2015-12-31T03:05:00.000-08:00"
StructField("Carrier", StringType(), True), # "Carrier":"WN"
StructField("DayOfMonth", IntegerType(), True), # "DayOfMonth":31
StructField("DayOfWeek", IntegerType(), True), # "DayOfWeek":4
StructField("DayOfYear", IntegerType(), True), # "DayOfYear":365
StructField("DepDelay", DoubleType(), True), # "DepDelay":14.0
StructField("Dest", StringType(), True), # "Dest":"SAN"
StructField("Distance", DoubleType(), True), # "Distance":368.0
StructField("FlightDate", DateType(), True), # "FlightDate":"2015-12-30T16:00:00.000-08:00"
StructField("FlightNum", StringType(), True), # "FlightNum":"6109"
StructField("Origin", StringType(), True), # "Origin":"TUS"
])
input_path = "{}/data/simple_flight_delay_features.jsonl.bz2".format(
base_path
)
features = spark.read.json(input_path, schema=schema)
features.first()
#
# Check for nulls in features before using Spark ML
#
null_counts = [(column, features.where(features[column].isNull()).count()) for column in features.columns]
cols_with_nulls = filter(lambda x: x[1] > 0, null_counts)
print(list(cols_with_nulls))
#
# Add a Route variable to replace FlightNum
#
from pyspark.sql.functions import lit, concat
features_with_route = features.withColumn(
'Route',
concat(
features.Origin,
lit('-'),
features.Dest
)
)
features_with_route.show(6)
#
# Use pysmark.ml.feature.Bucketizer to bucketize ArrDelay into on-time, slightly late, very late (0, 1, 2)
#
from pyspark.ml.feature import Bucketizer
# Setup the Bucketizer
splits = [-float("inf"), -15.0, 0, 30.0, float("inf")]
arrival_bucketizer = Bucketizer(
splits=splits,
inputCol="ArrDelay",
outputCol="ArrDelayBucket"
)
# Save the bucketizer
arrival_bucketizer_path = "{}/models/arrival_bucketizer_2.0.bin".format(base_path)
arrival_bucketizer.write().overwrite().save(arrival_bucketizer_path)
# Apply the bucketizer
ml_bucketized_features = arrival_bucketizer.transform(features_with_route)
ml_bucketized_features.select("ArrDelay", "ArrDelayBucket").show()
#
# Extract features tools in with pyspark.ml.feature
#
from pyspark.ml.feature import StringIndexer, VectorAssembler
# Turn category fields into indexes
for column in ["Carrier", "Origin", "Dest", "Route"]:
string_indexer = StringIndexer(
inputCol=column,
outputCol=column + "_index"
)
string_indexer_model = string_indexer.fit(ml_bucketized_features)
ml_bucketized_features = string_indexer_model.transform(ml_bucketized_features)
# Drop the original column
ml_bucketized_features = ml_bucketized_features.drop(column)
# Save the pipeline model
string_indexer_output_path = "{}/models/string_indexer_model_{}.bin".format(
base_path,
column
)
string_indexer_model.write().overwrite().save(string_indexer_output_path)
# Combine continuous, numeric fields with indexes of nominal ones
# ...into one feature vector
numeric_columns = [
"DepDelay", "Distance",
"DayOfMonth", "DayOfWeek",
"DayOfYear"]
index_columns = ["Carrier_index", "Origin_index",
"Dest_index", "Route_index"]
vector_assembler = VectorAssembler(
inputCols=numeric_columns + index_columns,
outputCol="Features_vec"
)
final_vectorized_features = vector_assembler.transform(ml_bucketized_features)
# Save the numeric vector assembler
vector_assembler_path = "{}/models/numeric_vector_assembler.bin".format(base_path)
vector_assembler.write().overwrite().save(vector_assembler_path)
# Drop the index columns
for column in index_columns:
final_vectorized_features = final_vectorized_features.drop(column)
# Inspect the finalized features
final_vectorized_features.show()
# Instantiate and fit random forest classifier on all the data
from pyspark.ml.classification import RandomForestClassifier
rfc = RandomForestClassifier(
featuresCol="Features_vec",
labelCol="ArrDelayBucket",
predictionCol="Prediction",
maxBins=4657,
maxMemoryInMB=1024
)
model = rfc.fit(final_vectorized_features)
# Save the new model over the old one
model_output_path = "{}/models/spark_random_forest_classifier.flight_delays.5.0.bin".format(
base_path
)
model.write().overwrite().save(model_output_path)
# Evaluate model using test data
predictions = model.transform(final_vectorized_features)
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
evaluator = MulticlassClassificationEvaluator(
predictionCol="Prediction",
labelCol="ArrDelayBucket",
metricName="accuracy"
)
accuracy = evaluator.evaluate(predictions)
print("Accuracy = {}".format(accuracy))
# Check the distribution of predictions
predictions.groupBy("Prediction").count().show()
# Check a sample
predictions.sample(False, 0.001, 18).orderBy("CRSDepTime").show(6)
if __name__ == "__main__":
main(sys.argv[1])
56. Data Syndrome: Agile Data Science 2.0
Experiment Setup
Necessary to improve model
56
57. Data Syndrome: Agile Data Science 2.0 57
155 additional lines to setup an experiment
and add improvements
345 L.O.C.
# !/usr/bin/env python
import sys, os, re
import json
import datetime, iso8601
from tabulate import tabulate
# Pass date and base path to main() from airflow
def main(base_path):
APP_NAME = "train_spark_mllib_model.py"
# If there is no SparkSession, create the environment
try:
sc and spark
except NameError as e:
import findspark
findspark.init()
import pyspark
import pyspark.sql
sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc).builder.appName(APP_NAME).getOrCreate()
#
# {
# "ArrDelay":5.0,"CRSArrTime":"2015-12-31T03:20:00.000-08:00","CRSDepTime":"2015-12-31T03:05:00.000-08:00",
# "Carrier":"WN","DayOfMonth":31,"DayOfWeek":4,"DayOfYear":365,"DepDelay":14.0,"Dest":"SAN","Distance":368.0,
# "FlightDate":"2015-12-30T16:00:00.000-08:00","FlightNum":"6109","Origin":"TUS"
# }
#
from pyspark.sql.types import StringType, IntegerType, FloatType, DoubleType, DateType, TimestampType
from pyspark.sql.types import StructType, StructField
from pyspark.sql.functions import udf
schema = StructType([
StructField("ArrDelay", DoubleType(), True), # "ArrDelay":5.0
StructField("CRSArrTime", TimestampType(), True), # "CRSArrTime":"2015-12-31T03:20:00.000-08:00"
StructField("CRSDepTime", TimestampType(), True), # "CRSDepTime":"2015-12-31T03:05:00.000-08:00"
StructField("Carrier", StringType(), True), # "Carrier":"WN"
StructField("DayOfMonth", IntegerType(), True), # "DayOfMonth":31
StructField("DayOfWeek", IntegerType(), True), # "DayOfWeek":4
StructField("DayOfYear", IntegerType(), True), # "DayOfYear":365
StructField("DepDelay", DoubleType(), True), # "DepDelay":14.0
StructField("Dest", StringType(), True), # "Dest":"SAN"
StructField("Distance", DoubleType(), True), # "Distance":368.0
StructField("FlightDate", DateType(), True), # "FlightDate":"2015-12-30T16:00:00.000-08:00"
StructField("FlightNum", StringType(), True), # "FlightNum":"6109"
StructField("Origin", StringType(), True), # "Origin":"TUS"
])
input_path = "{}/data/simple_flight_delay_features.json".format(
base_path
)
features = spark.read.json(input_path, schema=schema)
features.first()
#
# Add a Route variable to replace FlightNum
#
from pyspark.sql.functions import lit, concat
features_with_route = features.withColumn(
'Route',
concat(
features.Origin,
lit('-'),
features.Dest
)
)
features_with_route.show(6)
#
# Add the hour of day of scheduled arrival/departure
#
from pyspark.sql.functions import hour
features_with_hour = features_with_route.withColumn(
"CRSDepHourOfDay",
hour(features.CRSDepTime)
)
features_with_hour = features_with_hour.withColumn(
"CRSArrHourOfDay",
hour(features.CRSArrTime)
)
features_with_hour.select("CRSDepTime", "CRSDepHourOfDay", "CRSArrTime", "CRSArrHourOfDay").show()
#
# Use pysmark.ml.feature.Bucketizer to bucketize ArrDelay into on-time, slightly late, very late (0, 1, 2)
#
from pyspark.ml.feature import Bucketizer
# Setup the Bucketizer
splits = [-float("inf"), -15.0, 0, 30.0, float("inf")]
arrival_bucketizer = Bucketizer(
splits=splits,
inputCol="ArrDelay",
outputCol="ArrDelayBucket"
)
# Save the model
arrival_bucketizer_path = "{}/models/arrival_bucketizer_2.0.bin".format(base_path)
arrival_bucketizer.write().overwrite().save(arrival_bucketizer_path)
# Apply the model
ml_bucketized_features = arrival_bucketizer.transform(features_with_hour)
ml_bucketized_features.select("ArrDelay", "ArrDelayBucket").show()
#
# Extract features tools in with pyspark.ml.feature
#
from pyspark.ml.feature import StringIndexer, VectorAssembler
# Turn category fields into indexes
for column in ["Carrier", "Origin", "Dest", "Route"]:
string_indexer = StringIndexer(
inputCol=column,
outputCol=column + "_index"
)
string_indexer_model = string_indexer.fit(ml_bucketized_features)
ml_bucketized_features = string_indexer_model.transform(ml_bucketized_features)
# Save the pipeline model
string_indexer_output_path = "{}/models/string_indexer_model_3.0.{}.bin".format(
base_path,
column
)
string_indexer_model.write().overwrite().save(string_indexer_output_path)
# Combine continuous, numeric fields with indexes of nominal ones
# ...into one feature vector
numeric_columns = [
"DepDelay", "Distance",
"DayOfMonth", "DayOfWeek",
"DayOfYear", "CRSDepHourOfDay",
"CRSArrHourOfDay"]
index_columns = ["Carrier_index", "Origin_index",
"Dest_index", "Route_index"]
vector_assembler = VectorAssembler(
inputCols=numeric_columns + index_columns,
outputCol="Features_vec"
)
final_vectorized_features = vector_assembler.transform(ml_bucketized_features)
# Save the numeric vector assembler
vector_assembler_path = "{}/models/numeric_vector_assembler_3.0.bin".format(base_path)
vector_assembler.write().overwrite().save(vector_assembler_path)
# Drop the index columns
for column in index_columns:
final_vectorized_features = final_vectorized_features.drop(column)
# Inspect the finalized features
final_vectorized_features.show()
#
# Cross validate, train and evaluate classifier: loop 5 times for 4 metrics
#
from collections import defaultdict
scores = defaultdict(list)
feature_importances = defaultdict(list)
metric_names = ["accuracy", "weightedPrecision", "weightedRecall", "f1"]
split_count = 3
for i in range(1, split_count + 1):
print("nRun {} out of {} of test/train splits in cross validation...".format(
i,
split_count,
)
)
# Test/train split
training_data, test_data = final_vectorized_features.randomSplit([0.8, 0.2])
# Instantiate and fit random forest classifier on all the data
from pyspark.ml.classification import RandomForestClassifier
rfc = RandomForestClassifier(
featuresCol="Features_vec",
labelCol="ArrDelayBucket",
predictionCol="Prediction",
maxBins=4657,
)
model = rfc.fit(training_data)
# Save the new model over the old one
model_output_path = "{}/models/spark_random_forest_classifier.flight_delays.baseline.bin".format(
base_path
)
model.write().overwrite().save(model_output_path)
# Evaluate model using test data
predictions = model.transform(test_data)
# Evaluate this split's results for each metric
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
for metric_name in metric_names:
evaluator = MulticlassClassificationEvaluator(
labelCol="ArrDelayBucket",
predictionCol="Prediction",
metricName=metric_name
)
score = evaluator.evaluate(predictions)
scores[metric_name].append(score)
print("{} = {}".format(metric_name, score))
#
# Collect feature importances
#
feature_names = vector_assembler.getInputCols()
feature_importance_list = model.featureImportances
for feature_name, feature_importance in zip(feature_names, feature_importance_list):
feature_importances[feature_name].append(feature_importance)
#
# Evaluate average and STD of each metric and print a table
#
import numpy as np
score_averages = defaultdict(float)
# Compute the table data
average_stds = [] # ha
for metric_name in metric_names:
metric_scores = scores[metric_name]
average_accuracy = sum(metric_scores) / len(metric_scores)
score_averages[metric_name] = average_accuracy
std_accuracy = np.std(metric_scores)
average_stds.append((metric_name, average_accuracy, std_accuracy))
# Print the table
print("nExperiment Log")
print("--------------")
print(tabulate(average_stds, headers=["Metric", "Average", "STD"]))
#
# Persist the score to a sccore log that exists between runs
#
import pickle
# Load the score log or initialize an empty one
try:
score_log_filename = "{}/models/score_log.pickle".format(base_path)
score_log = pickle.load(open(score_log_filename, "rb"))
if not isinstance(score_log, list):
score_log = []
except IOError:
score_log = []
# Compute the existing score log entry
score_log_entry = {metric_name: score_averages[metric_name] for metric_name in metric_names}
# Compute and display the change in score for each metric
try:
last_log = score_log[-1]
except (IndexError, TypeError, AttributeError):
last_log = score_log_entry
experiment_report = []
for metric_name in metric_names:
run_delta = score_log_entry[metric_name] - last_log[metric_name]
experiment_report.append((metric_name, run_delta))
print("nExperiment Report")
print("-----------------")
print(tabulate(experiment_report, headers=["Metric", "Score"]))
# Append the existing average scores to the log
score_log.append(score_log_entry)
# Persist the log for next run
pickle.dump(score_log, open(score_log_filename, "wb"))
#
# Analyze and report feature importance changes
#
# Compute averages for each feature
feature_importance_entry = defaultdict(float)
for feature_name, value_list in feature_importances.items():
average_importance = sum(value_list) / len(value_list)
feature_importance_entry[feature_name] = average_importance
# Sort the feature importances in descending order and print
import operator
sorted_feature_importances = sorted(
feature_importance_entry.items(),
key=operator.itemgetter(1),
reverse=True
)
print("nFeature Importances")
print("-------------------")
print(tabulate(sorted_feature_importances, headers=['Name', 'Importance']))
#
# Compare this run's feature importances with the previous run's
#
# Load the feature importance log or initialize an empty one
try:
feature_log_filename = "{}/models/feature_log.pickle".format(base_path)
feature_log = pickle.load(open(feature_log_filename, "rb"))
if not isinstance(feature_log, list):
feature_log = []
except IOError:
feature_log = []
# Compute and display the change in score for each feature
try:
last_feature_log = feature_log[-1]
except (IndexError, TypeError, AttributeError):
last_feature_log = defaultdict(float)
for feature_name, importance in feature_importance_entry.items():
last_feature_log[feature_name] = importance
# Compute the deltas
feature_deltas = {}
for feature_name in feature_importances.keys():
run_delta = feature_importance_entry[feature_name] - last_feature_log[feature_name]
feature_deltas[feature_name] = run_delta
# Sort feature deltas, biggest change first
import operator
sorted_feature_deltas = sorted(
feature_deltas.items(),
key=operator.itemgetter(1),
reverse=True
)
# Display sorted feature deltas
print("nFeature Importance Delta Report")
print("-------------------------------")
print(tabulate(sorted_feature_deltas, headers=["Feature", "Delta"]))
# Append the existing average deltas to the log
feature_log.append(feature_importance_entry)
# Persist the log for next run
pickle.dump(feature_log, open(feature_log_filename, "wb"))
if __name__ == "__main__":
main(sys.argv[1])