Successfully reported this slideshow.
Your SlideShare is downloading. ×

Learn How to Run Python on Redshift

Ad

How Bellhops Leverages Amazon Redshift UDFs
for Massively Parallel Data Science
Ian Eaves, Bellhops
May 12th, 2016

Ad

Today’s Speakers
Chartio
AJ Welch
Chartio.com
Bellhops
Ian Eaves
GetBellhops.com
AWS
Brandon Chavis
aws.amazon.com

Ad

• The recording will be sent to all webinar participants after the event.
• Questions? Type them in the chat box & we will...

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Check these out next

1 of 41 Ad
1 of 41 Ad

Learn How to Run Python on Redshift

Download to read offline

Ian Eaves, Data Scientist of Bellhops, shares how he uses Amazon Redshift's user-defined functions (UDFs) and Chartio to save multiple hours each week by running Python analysis directly in Amazon Redshift.
Learn how Bellhops combines Python with the power of Redshift to quickly analyze large datasets in real time, opening up new possibilities for their data teams.

Ian Eaves, Data Scientist of Bellhops, shares how he uses Amazon Redshift's user-defined functions (UDFs) and Chartio to save multiple hours each week by running Python analysis directly in Amazon Redshift.
Learn how Bellhops combines Python with the power of Redshift to quickly analyze large datasets in real time, opening up new possibilities for their data teams.

More Related Content

Similar to Learn How to Run Python on Redshift (20)

Learn How to Run Python on Redshift

  1. 1. How Bellhops Leverages Amazon Redshift UDFs for Massively Parallel Data Science Ian Eaves, Bellhops May 12th, 2016
  2. 2. Today’s Speakers Chartio AJ Welch Chartio.com Bellhops Ian Eaves GetBellhops.com AWS Brandon Chavis aws.amazon.com
  3. 3. • The recording will be sent to all webinar participants after the event. • Questions? Type them in the chat box & we will answer at the end • Posting to social? Use #AWSandChartio Housekeeping
  4. 4. Relational data warehouse Massively parallel; Petabyte scale Fully managed HDD and SSD Platforms $1,000/TB/Year; starts at $0.25/hour Amazon Redshift What is Amazon Redshift?
  5. 5. Amazon Redshift is easy to use • Provision in minutes • Monitor query performance • Point and click resize • Built in security • Automatic backups
  6. 6. Amazon Redshift System Architecture 10 GigE (HPC) Ingestion Backup Restore JDBC/ODBC
  7. 7. The Amazon Redshift view of data warehousing • 10x cheaper • Easy to provision • Higher DBA productivity • 10x faster • No programming • Easily leverage BI tools, Hadoop, Machine Learning, Streaming • Analysis in-line with process flows • Pay as you go, grow as you need • Managed availability & DR Enterprise Big Data SaaS
  8. 8. The legacy view of data warehousing ... Global 2,000 companies Sell to central IT Multi-year commitment Multi-year deployments Multi-million dollar deals
  9. 9. … Leads to dark data This is a narrow view Small companies also have big data (mobile, social, gaming, adtech, IoT) Long cycles, high costs, administrative complexity all stifle innovation 0 200 400 600 800 1000 1200 Enterprise Data Data in Warehouse
  10. 10. New SQL Functions We add SQL functions regularly to expand Amazon Redshift’s query capabilities Added 25+ window and aggregate functions since launch, including: • LISTAGG • [APPROXIMATE] COUNT • DROP IF EXISTS, CREATE IF NOT EXISTS • REGEXP_SUBSTR, _COUNT, _INSTR, _REPLACE • PERCENTILE_CONT, _DISC, MEDIAN • PERCENT_RANK, RATIO_TO_REPORT We’ll continue iterating but also want to enable you to write your own
  11. 11. Scalar User Defined Functions You can write UDFs using Python 2.7 • Syntax is largely identical to PostgreSQL UDF syntax • System and network calls within UDFs are prohibited Comes with Pandas, NumPy, and SciPy pre-installed • You’ll also be able import your own libraries for even more flexibility
  12. 12. Scalar UDF Example CREATE FUNCTION f_hostname (VARCHAR url) RETURNS varchar IMMUTABLE AS $$ import urlparse return urlparse.urlparse(url).hostname $$ LANGUAGE plpythonu;
  13. 13. Analytics for Everyone The best platform for everyone to explore and visualize data
  14. 14. The Smartest Companies Use Chartio
  15. 15. Legacy BI • Expensive to set up • Expensive to maintain • Requires technical skills • Creates a bottleneck • Limits your ability to make decisions
  16. 16. Modern BI • Faster time to value • Easier to maintain • Modes for both technical and non-technical users • Alleviates bottlenecks • Enhances your ability to make decisions
  17. 17. Chartio’s Modern Architecture Schema/ Business Rules Interactive Mode SQL Mode Data Stores TV Screens Scheduled Emails Data Exploration Dashboards Embedded Data Pipeline/ Data Blending Data Caching Security
  18. 18. Chartio’s Modern Architecture Schema/ Business Rules Interactive Mode SQL Mode Data Stores TV Screens Scheduled Emails Data Exploration Dashboards Embedded Data Pipeline/ Data Blending Data Caching Security
  19. 19. Chartio’s Modern Architecture
  20. 20. The Chartio Schema Editor • Team-specific schemas • Rename tables/columns • Hide tables/columns • Define custom tables/columns • Define data types and foreign keys
  21. 21. Schema Editor Live Demo
  22. 22. UDF’s (A Brave New World) Using UDF’s in the Real World Ian Eaves - Data Scientist
  23. 23. The Land Between SQL and Analysis
  24. 24. The Land Between SQL and Scripts UDF
  25. 25. A Little About Bellhops • On-demand moving and labor company • Self-scheduling capacity (a la Uber) • Located in 83 markets
  26. 26. Herfindahl Index (H) N = number of market actors si = market share of ith actor
  27. 27. Herfindahl Index
  28. 28. Market Health Feedback Process Monthly Data Warehouse
  29. 29. Market Health Feedback Process Monthly Data Warehouse
  30. 30. Market Health Feedback Process Monthly UDF’s Data Warehouse
  31. 31. Market Health Feedback Process Monthly Users UDF’s Data Warehouse
  32. 32. T-test UDF create function f_t_test (val float, mean float, stddev float, n_samps float, alpha float) returns varchar stable as $$ from scipy.stats import t df = n_samps - 1 tval = (mean - val) / stddev p = t.sf(abs(tval), df) * 2 # two sided if p <= alpha: return 'Better' if tval > 0 else 'Worse' else: return 'No Change' $$LANGUAGE plpythonu;
  33. 33. Example Table Schema market_month_dimension ——————————— id date herfindahl_index ~ ~ ~ market_fact ————————- id month_key market_name ~ ~
  34. 34. Our Query WITH market_stats AS ( SELECT market_name, date, mmd.herfindahl_index, avg(herfindahl_index) OVER (PARTITION BY market_name ORDER BY date ROWS BETWEEN 6 PRECEDING AND 1 PRECEDING) as avg, stddev_samp(herfindahl_index) OVER (PARTITION BY market_name ORDER BY date ROWS BETWEEN 6 PRECEDING AND 1 PRECEDING) as FROM market_fact LEFT JOIN( SELECT herfindahl_index, month_key as join_key, date FROM market_month_dimension ) AS mmd ON join_key = market_fact.month_key GROUP BY market_name, date, month_key, mmd.herfindahl_index ORDER BY date ) SELECT market_name, date, herfindahl_index, avg, stddev, f_t_test(herfindahl_index, avg, stddev, 6, .2) FROM market_stats WHERE market_name = 'Atlanta' ORDER BY market_name, date;
  35. 35. Sample Result
  36. 36. General Thoughts UDF Use Cases
  37. 37. General Thoughts UDF Use Cases Complicated Conditional Logic
  38. 38. General Thoughts UDF Use Cases Complicated Conditional Logic Text Processing
  39. 39. General Thoughts UDF Use Cases Complicated Conditional Logic Text Processing Basic Statistical Analysis
  40. 40. Thank You
  41. 41. Questions? Chartio AJ Welch aj@chartio.com Chartio.com Bellhops Ian Eaves ian.eaves@getbellhops.com GetBellhops.com AWS Brandon Chavis chavisb@amazon.com aws.amazon.com

Editor's Notes

  • For those unfamiliar with Amazon Redshift, it is a fast, fully managed, petabyte-scale data warehouse for less than $1000 per terabyte per year.

    fast, cost effective, easy to use (launch cluster in a few minutes, scale with the push of a button)
  • Redshift is not only cheaper but also easy to use. Provisioning takes 15 minutes.
  • Header Only
  • Two-Section
  • Three Section
  • Three Section
  • Section Header
  • While SQL is a phenomenal tool for data extraction it’s either painful or impossible to work with for analysis.

    Genera purpose programming languages like Python on the other hand are better suited to analysis and visualization but more difficult to use for pure extraction

    Into this gap services like Chartio have emerged providing extended visualization and analysis options usually accomplished with those more traditional programming tools.
  • UDF’s begin to bridge this gap by providing limited python functionality within the scope of your standard SQL toolbox.
  • Because our supply of labor (lovingly referred to as Bellhops) are free to set their own schedule understanding the health of a market is extremely important.

    Too many Bellhops chasing too little work yields high churn and inexperienced laborers.

    On the other hand having only a handful of Bellhops might be sufficient to service demand in small or growing markets. However, this dynamic is unstable - what happens if a Bellhop decides to take a month off? Or how will it respond to sudden spikes in demand as happens during the summer?

    One of the measures we use to determine when a market has entered an unstable dynamic like this is the Herfindahl Index
  • There’s no need to linger on this but the Herfindahl Index is the sum of the squared market shares of actors in a market.

    So for example, if we were looking at the Soda industry the actors in that market would be Coca Cola, Pepsi, Fanta, etc…

    More important is how it’s used. (next slide)
  • It’s found common usage in economics to determine if an industry has become monopolistic and to what degree that might be the case.

    We at Bellhops use a similar idea to measure the concentration of work amongst our labor supply in each market.

    A combination of this metric and UDF’s allows us to provide real time feedback to the organization which would otherwise require separate extraction, processing, and analysis steps.

    Take our index as an example. We are interested in knowing if there have been any sudden changes in indexed concentration for any of our markets.

    This can be used as a call to action for our market health team
    So how do we do that?

  • First we calculate the current state of each market every day, week, month, etc as part of our ETL process

  • These values are then feed from our warehouse into psql, Chartio, or any other tool of your choice. In our case our end users (The Market Health Team) interact with data primarily through Chartio so that’s where it will sit.


  • We next feed these values into a Python UDF

    Which In this case is an implementation of the Students T-test.

    A t test effectively allows you to determine a value differs significantly from it’s historical distribution and the degree to which it differs. It’s an especially important distribution when the number of samples being used is small.

    In this case we will be determining whether a markets concentration differs significantly from the historical (say past 6 months) of observations.
  • Finally these significance warnings are surfaced directly to relevant users through pre-made Chartio dashboards for them to take action when necessary.
  • Finally here is our actual UDF. This is an implementation of a two sided t test with a couple of notable features.

    We were able to make use of prebuilt python functionality like scipy
    So long as our UDF exclusively uses data available immediately within it’s scope (unfortunately meaning no disk or network access) we have all the power of python at our finger tops. That means things like complicated conditional logic can be trivially implemented bypassing otherwise clumsy SQL.
  • Let’s just take a toy model of two tables in our data warehouse. The first is a fact key containing a market_id and a foreign key to the market_month_dimension table.

    The market_month_dimension table contains a variety of statistics calculated monthly for each market; one of which is the herfindahl_index.
  • With our UDF and schema in hand we can now execute a query!

    In this case we are using our T-test UDF to determine in which months Atlanta’s herfindahl index changed dramatically as compared to the past six months.

    As you can see actually using the UDF is extremely simple and behaves as if it were any other function. The majority of the hard work lies in constructing our temporary table containing the six month moving average and standard deviations.
  • By their scalar nature UDF’s are in some sense reflective rather than prescriptive we found that reflective nature to be most useful in support of the analytics being performed by our BI team.

    They are additionally useful when cumbersome SQL expressions might be simplified by an equivalent python library or representation. Things like (slide)
  • Complicated conditional logic
  • Text processing especially when the equivalent regular expression is complicated or contains numerous edge cases (urls, emails, etc…)
  • and doing basic statistical analysis.

×