Learn How to Run Python on Redshift

Chartio
May. 12, 2016
Learn How to Run Python on Redshift
Learn How to Run Python on Redshift
Learn How to Run Python on Redshift
Learn How to Run Python on Redshift
Learn How to Run Python on Redshift
Learn How to Run Python on Redshift
Learn How to Run Python on Redshift
Learn How to Run Python on Redshift
Learn How to Run Python on Redshift
Learn How to Run Python on Redshift
Learn How to Run Python on Redshift
Learn How to Run Python on Redshift
Learn How to Run Python on Redshift
Learn How to Run Python on Redshift
Learn How to Run Python on Redshift
Learn How to Run Python on Redshift
Learn How to Run Python on Redshift
Learn How to Run Python on Redshift
Learn How to Run Python on Redshift
Learn How to Run Python on Redshift
Learn How to Run Python on Redshift
Learn How to Run Python on Redshift
Learn How to Run Python on Redshift
Learn How to Run Python on Redshift
Learn How to Run Python on Redshift
Learn How to Run Python on Redshift
Learn How to Run Python on Redshift
Learn How to Run Python on Redshift
Learn How to Run Python on Redshift
Learn How to Run Python on Redshift
Learn How to Run Python on Redshift
Learn How to Run Python on Redshift
Learn How to Run Python on Redshift
Learn How to Run Python on Redshift
Learn How to Run Python on Redshift
Learn How to Run Python on Redshift
Learn How to Run Python on Redshift
Learn How to Run Python on Redshift
Learn How to Run Python on Redshift
Learn How to Run Python on Redshift
Learn How to Run Python on Redshift
1 of 41

More Related Content

What's hot

DataFrames: The Good, Bad, and UglyDataFrames: The Good, Bad, and Ugly
DataFrames: The Good, Bad, and UglyWes McKinney
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Wes McKinney
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsgagravarr
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenWes McKinney
Apache drillApache drill
Apache drillMapR Technologies
Scaling Machine Learning with Apache SparkScaling Machine Learning with Apache Spark
Scaling Machine Learning with Apache SparkDatabricks

Similar to Learn How to Run Python on Redshift

AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...Amazon Web Services
Amazon Redshift Deep Dive Amazon Redshift Deep Dive
Amazon Redshift Deep Dive Amazon Web Services
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNATomas Cervenka
Michael Hall [InfluxData] | Become an InfluxDB Pro in 20 Minutes | InfluxDays...Michael Hall [InfluxData] | Become an InfluxDB Pro in 20 Minutes | InfluxDays...
Michael Hall [InfluxData] | Become an InfluxDB Pro in 20 Minutes | InfluxDays...InfluxData
Revolution R Enterprise - Portland R User Group, November 2013Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013Revolution Analytics
Big data week presentationBig data week presentation
Big data week presentationJoseph Adler

Similar to Learn How to Run Python on Redshift(20)

More from Chartio

Rethinking Your Ad Spend: 5 Tips  for intelligent digital advertisingRethinking Your Ad Spend: 5 Tips  for intelligent digital advertising
Rethinking Your Ad Spend: 5 Tips for intelligent digital advertisingChartio
How To Drive Exponential Growth Using Unconventional Data SourcesHow To Drive Exponential Growth Using Unconventional Data Sources
How To Drive Exponential Growth Using Unconventional Data SourcesChartio
7 Reasons You Haven't Reached Hyper-Growth7 Reasons You Haven't Reached Hyper-Growth
7 Reasons You Haven't Reached Hyper-GrowthChartio
Redshift Chartio Event PresentationRedshift Chartio Event Presentation
Redshift Chartio Event PresentationChartio
The Vital Metrics Every Sales Team Should Be MeasuringThe Vital Metrics Every Sales Team Should Be Measuring
The Vital Metrics Every Sales Team Should Be MeasuringChartio
CSV and XLS Best PracticesCSV and XLS Best Practices
CSV and XLS Best PracticesChartio

Recently uploaded

Need for Speed: Removing speed bumps in API ProjectsNeed for Speed: Removing speed bumps in API Projects
Need for Speed: Removing speed bumps in API ProjectsŁukasz Chruściel
LLaMA 2.pptxLLaMA 2.pptx
LLaMA 2.pptxRkRahul16
UiPath Tips and Techniques for Debugging - Session 3UiPath Tips and Techniques for Debugging - Session 3
UiPath Tips and Techniques for Debugging - Session 3DianaGray10
Easy Salesforce CI/CD with Open Source Only - Dreamforce 23Easy Salesforce CI/CD with Open Source Only - Dreamforce 23
Easy Salesforce CI/CD with Open Source Only - Dreamforce 23NicolasVuillamy1
Supplier Sourcing_Cathy.pptxSupplier Sourcing_Cathy.pptx
Supplier Sourcing_Cathy.pptxCatarinaTorrenuevaMa
Product Research Presentation-Maidy Veloso.pptxProduct Research Presentation-Maidy Veloso.pptx
Product Research Presentation-Maidy Veloso.pptxMaidyVeloso

Recently uploaded(20)

Learn How to Run Python on Redshift

Editor's Notes

  1. For those unfamiliar with Amazon Redshift, it is a fast, fully managed, petabyte-scale data warehouse for less than $1000 per terabyte per year. fast, cost effective, easy to use (launch cluster in a few minutes, scale with the push of a button)
  2. Redshift is not only cheaper but also easy to use. Provisioning takes 15 minutes.
  3. Header Only
  4. Two-Section
  5. Three Section
  6. Three Section
  7. Section Header
  8. While SQL is a phenomenal tool for data extraction it’s either painful or impossible to work with for analysis. Genera purpose programming languages like Python on the other hand are better suited to analysis and visualization but more difficult to use for pure extraction Into this gap services like Chartio have emerged providing extended visualization and analysis options usually accomplished with those more traditional programming tools.
  9. UDF’s begin to bridge this gap by providing limited python functionality within the scope of your standard SQL toolbox.
  10. Because our supply of labor (lovingly referred to as Bellhops) are free to set their own schedule understanding the health of a market is extremely important. Too many Bellhops chasing too little work yields high churn and inexperienced laborers. On the other hand having only a handful of Bellhops might be sufficient to service demand in small or growing markets. However, this dynamic is unstable - what happens if a Bellhop decides to take a month off? Or how will it respond to sudden spikes in demand as happens during the summer? One of the measures we use to determine when a market has entered an unstable dynamic like this is the Herfindahl Index
  11. There’s no need to linger on this but the Herfindahl Index is the sum of the squared market shares of actors in a market. So for example, if we were looking at the Soda industry the actors in that market would be Coca Cola, Pepsi, Fanta, etc… More important is how it’s used. (next slide)
  12. It’s found common usage in economics to determine if an industry has become monopolistic and to what degree that might be the case. We at Bellhops use a similar idea to measure the concentration of work amongst our labor supply in each market. A combination of this metric and UDF’s allows us to provide real time feedback to the organization which would otherwise require separate extraction, processing, and analysis steps. Take our index as an example. We are interested in knowing if there have been any sudden changes in indexed concentration for any of our markets. This can be used as a call to action for our market health team So how do we do that?
  13. First we calculate the current state of each market every day, week, month, etc as part of our ETL process
  14. These values are then feed from our warehouse into psql, Chartio, or any other tool of your choice. In our case our end users (The Market Health Team) interact with data primarily through Chartio so that’s where it will sit.
  15. We next feed these values into a Python UDF Which In this case is an implementation of the Students T-test. A t test effectively allows you to determine a value differs significantly from it’s historical distribution and the degree to which it differs. It’s an especially important distribution when the number of samples being used is small. In this case we will be determining whether a markets concentration differs significantly from the historical (say past 6 months) of observations.
  16. Finally these significance warnings are surfaced directly to relevant users through pre-made Chartio dashboards for them to take action when necessary.
  17. Finally here is our actual UDF. This is an implementation of a two sided t test with a couple of notable features. We were able to make use of prebuilt python functionality like scipy So long as our UDF exclusively uses data available immediately within it’s scope (unfortunately meaning no disk or network access) we have all the power of python at our finger tops. That means things like complicated conditional logic can be trivially implemented bypassing otherwise clumsy SQL.
  18. Let’s just take a toy model of two tables in our data warehouse. The first is a fact key containing a market_id and a foreign key to the market_month_dimension table. The market_month_dimension table contains a variety of statistics calculated monthly for each market; one of which is the herfindahl_index.
  19. With our UDF and schema in hand we can now execute a query! In this case we are using our T-test UDF to determine in which months Atlanta’s herfindahl index changed dramatically as compared to the past six months. As you can see actually using the UDF is extremely simple and behaves as if it were any other function. The majority of the hard work lies in constructing our temporary table containing the six month moving average and standard deviations.
  20. By their scalar nature UDF’s are in some sense reflective rather than prescriptive we found that reflective nature to be most useful in support of the analytics being performed by our BI team. They are additionally useful when cumbersome SQL expressions might be simplified by an equivalent python library or representation. Things like (slide)
  21. Complicated conditional logic
  22. Text processing especially when the equivalent regular expression is complicated or contains numerous edge cases (urls, emails, etc…)
  23. and doing basic statistical analysis.