Advertisement

How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost

Developer Marketing and Relations at MuleSoft
Nov. 24, 2020
Advertisement

More Related Content

Slideshows for you(20)

Similar to How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost(20)

Advertisement

More from Databricks(20)

Advertisement

How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost

  1. How The Weather Company® Uses Spark to Serve Weather Data Faster and Cheaper Erik Goepfert and Paula Ta-Shma IBM TWC and IBM Research By Leveraging IBM Cloud® SQL Query and Cloud Object Storage
  2. The Weather Company started with a simple mission to
  3. decisions solutions Map the atmosphere every 15 minutes Process over 400 terabytes of data daily Deliver more than 50 billion requests for weather information every day and produce 25 billion forecasts daily Source: Qliksense internal report, April 2017; According to internal forecasting system + # of locations in the world by Lat Lon locations (2 decimal places); 400 terabytes according to internal SUN platform numbers And has evolved into
  4. Source: ForecastWatch, Three Region Accuracy Overview, 2010-2017, https://www.forecastwatch.com/static/Three_Region_Accuracy_Overview_2010-2017.pdf History on Demand Conditions (HoD) Provides access to a worldwide, hourly, high- resolution, gridded dataset of past weather conditions via a web API Global 4 km grid 0.044-degree resolution 34 potential weather properties 34 million records added every hour Geospatial and temporal search Point, bounding box, and polygon search over a time range Usage Averages 600,000 requests per day Used by clients primarily for machine learning and data analytics Supports research in domains such as climate science, energy & utilities, agriculture, transportation, insurance, and retail
  5. Problems with our previous solution ▪ Expensive ▪ Our synchronous data access solution is expensive ▪ Limited storage capacity ▪ Hard storage limits per cluster with our previous cloud provider and storage solution ▪ We have reduced the data we provide ▪ To lower cost and stay below the storage limit, we've reduced our data to land only, and 20 of the available 34 weather properties ▪ Clients are limited to small requests ▪ To allow for a synchronous interaction, clients are required to limit the scope of their requests to 2,400 records ▪ Slow at retrieving large amounts of data ▪ Because of the small query sizes, it is time consuming to retrieve large amounts of data
  6. Our new asynchronous solution ▪ More cost-effective ▪ Our use of IBM Cloud SQL Query and Cloud Object Storage has resulted in an order of magnitude reduction in cost ▪ Unlimited storage ▪ With Cloud Object Storage we effectively have an unlimited storage capacity ▪ Global weather data coverage with all 34 weather properties ▪ With the reduced cost and unlimited storage we no longer have to limit the data we provide ▪ Support for large requests ▪ With an asynchronous interaction, clients can now submit a single request for everything they're interested in ▪ Large amounts of data retrieved quickly with a single query ▪ Because we can rely on IBM Cloud SQL Query using Spark behind the scenes, large queries complete relatively quickly
  7. Solution Overview Serverless approach ▪ Pay per use -> Low cost IBM Cloud SQL Query ▪ Serverless SQL powered by Spark IBM Cloud Object Storage (COS) ▪ S3 Compatible API Apply Best Practices ▪ Parquet ▪ Geospatial Data Layout
  8. IBM Cloud SQL Query ▪ Serverless SQL service built on Apache Spark ▪ Supports all Spark native data formats e.g. Parquet, ORC, CSV, Avro, JSON ▪ Intuitive UI, no set up/installation required ▪ Integrated with Watson Studio notebooks, Cloud Functions and supports REST APIs ▪ Query and ETL data on COS directly ▪ Also supports Db2 ▪ High Performance ▪ Built-in Catalog – multi-tenant Hive Metastore ▪ Data Skipping indexes ▪ Low Cost ▪ $5/TB scanned ▪ Skip over irrelevant data using Catalog and Data Skipping ▪ Zero standing cost
  9. SQL Query Catalog ▪ Multi-tenant Hive Metastore (HMS) ▪ Critical to achieve high performance for Big Data 1. Spark SQL leverages HMS for partition pruning – avoid reading irrelevant partitions ▪ More flexible than the Hive Style Partitioning naming convention 2. Significantly reduces time spent on object listing ▪ HMS enables listing partitions in parallel – sequential listing can be very slow 3. Stores schema and statistics for Cost Based Optimization ▪ Useful to maintain consistency ▪ Can consistently replace an entire partition ▪ Needed for changing the data layout ▪ Future: use Delta/Iceberg/Hudi format for this Weather/dt=2020-08-17/part-00085.parquet Weather/dt=2020-08-17/part-00086.parquet Weather/dt=2020-08-17/part-00087.parquet Weather/dt=2020-08-17/part-00088.parquet Weather/dt=2020-08-18/part-00001.parquet Weather/dt=2020-08-18/part-00002.parquet Partition MD Partition MD HMS
  10. Geospatial Analytics in SQL Query ▪ Supports geospatial operations and data types - According to the SQL/MM standard - Aggregation, computation and join via native SQL syntax ▪ Geodetic Full Earth support - Increased developer productivity ▪ Avoid piece-wise planar projections - High precision calculations anywhere on earth - Very large polygons e.g. countries, polar caps etc. ▪ Native geohash support - Fine granularity - Fast spatial aggregation ▪ Geospatial Data Skipping
  11. Data Skipping in SQL Query ▪ Avoid reading irrelevant objects using indexes ▪ Complements partition pruning -> object level pruning ▪ Stores aggregate metadata per object to enable skipping decisions ▪ Indexes are stored in COS ▪ Supports multiple index types ▪ Currently MinMax, ValueList, BloomFilter, Geospatial ▪ Underlying data skipping library is extensible ▪ New index types can easily be supported ▪ Enables data skipping for Queries with UDFs ▪ e.g. ST_Contains, ST_Distance etc. ▪ UDFs are mapped to indexes
  12. How Data Skipping Works Query Prune partitions Read data Query Prune partitions Optional file filter Read data Metadata Filter Spark SQL Query Execution Flow Uses Catalyst optimizer and session extensions API
  13. Data Skipping Example Weather/dt=2020-08-17/part-00085.parquet Weather/dt=2020-08-17/part-00086.parquet Weather/dt=2020-08-17/part-00087.parquet Weather/dt=2020-08-17/part-00088.parquet Weather/dt=2020-08-18/part-00001.parquet Weather/dt=2020-08-18/part-00002.parquet Data Object Listing Example Query SELECT * FROM cos://us-geo/twc/Weather STORED AS parquet WHERE temp > 40 Object Name Temp Min Temp Max ... dt=2020-08-17/part-00085 7.97 26.77 dt=2020-08-17/part-00086 2.45 23.71 dt=2020-08-17/part-00087 6.46 18.62 dt=2020-08-17/part-00088 23.67 41.02 ... Metadata Red objects are not relevant to this query
  14. Data Skipping Example Weather/dt=2020-08-17/part-00085.parquet Weather/dt=2020-08-17/part-00086.parquet Weather/dt=2020-08-17/part-00087.parquet Weather/dt=2020-08-17/part-00088.parquet Weather/dt=2020-08-18/part-00001.parquet Weather/dt=2020-08-18/part-00002.parquet Data Object Listing Example Query SELECT * FROM cos://us-geo/twc/Weather STORED AS parquet WHERE temp > 40 Object Name Temp Min Temp Max ... dt=2020-08-17/part-00085 7.97 26.77 dt=2020-08-17/part-00086 2.45 23.71 dt=2020-08-17/part-00087 6.46 18.62 dt=2020-08-17/part-00088 23.67 41.02 ... Metadata Red objects are not relevant to this query Data layout is important to get good skipping
  15. HoD Data Layout in Production gcod/v1/ hourly/year=2019/month=2/ 20190201T002000Z-part-00000.parquet 20190201T002000Z-part-00001.parquet … 20190218T232000Z-part-00007.parquet 20190218T232000Z-part-00008.parquet monthly/year=2019/month=1/ part-00000.parquet part-00001.parquet … part-08191.parquet
  16. HoD Data Layout in Production MonthlyHourly * boundaries here are an approximation, not based on actual data
  17. Geospatial Data Skipping Example Example Query SELECT * FROM Weather STORED AS parquet WHERE ST_Contains(ST_WKTToSQL('POLYGON((-78.93 36.00, -78.67 35.78, -79.04 35.90, -78.93 36.00))'), ST_Point(long, lat)) INTO cos://us-south/results STORED AS parquet Object Name lat Min lat Max ... dt=2020-08-17/part-00085 35.02 36.17 dt=2020-08-17/part-00086 43.59 44.95 dt=2020-08-17/part-00087 34.86 40.62 dt=2020-08-17/part-00088 23.67 25.92 ... Metadata Red objects are not relevant to this query Raleigh Research Triangle (US) Map ST Contains UDF to necessary conditions on lat, long
  18. Query Rewrite Approach Example Query SELECT * FROM Weather STORED AS parquet WHERE ST_Contains(ST_WKTToSQL('POLYGON((-78.93 36.00, -78.67 35.78, -79.04 35.90, -78.93 36.00))'), ST_Point(long, lat)) INTO cos://us-south/results STORED AS parquet Raleigh Research Triangle (US) Rewritten Query SELECT * FROM Weather STORED AS parquet WHERE ST_Contains(ST_WKTToSQL('POLYGON((-78.93 36.00, -78.67 35.78, -79.04 35.90, -78.93 36.00))'), ST_Point(long, lat)) AND long BETWEEN -79.04 AND -78.67 AND lat BETWEEN 35.78 AND 36.00 INTO cos://us-south/results STORED AS parquet
  19. Benefits of Consolidated Metadata Query rewrite approach can leverage MinMax metadata in Parquet/ORC formats Consolidated metadata approach performs better ▪ Avoids reading footers ▪ Better resource allocation X3.6 faster
  20. X10 Acceleration with Data Skipping and Catalog Assumes query rewrite approach (yellow) is the baseline • Requires Parquet/ORC For other formats the acceleration is much larger • e.g. CSV/JSON/Avro Experiment uses Raleigh Research Triangle query X10 speedup on average
  21. Demo
  22. Demo Stats ▪ 6.404 TB in Parquet format ▪ 172,004 objects ▪ 36 MB per object (on average) ▪ 21 months of weather data ▪ 21 partitions ▪ Create table: 3.8s ▪ Recover partitions: 21.9s ▪ Create indexes: 12 min 17.0s ▪ Data scanned: 5.72 MB ▪ Geospatial query: 1 min 14.0s ▪ Data scanned: 20.4 MB ▪ Catalog: skips 20 of 21 partitions ▪ Data skipped: 8186 of 8190 objects Performance StatsTWC Demo Dataset Properties
  23. Example Query used by HoD in Production SELECT * FROM hod_gcod WHERE ( year = 2016 AND 10 <= month OR year BETWEEN 2017 AND 2019 OR year = 2020 AND month <= 3 ) AND date_time BETWEEN timestamp("2016-10-15 00:00:00Z") AND timestamp("2020-03-11 00:00:00Z") AND ST_Contains( ST_Boundingbox(-111.711, 41.081, -109.953, 42.840), ST_Point(longitude, latitude) ) INTO cos://us-east/my-results-bucket STORED AS CSV
  24. Query Runtime for HoD in Production Querying a 40x40 gridpoint area (25,000 km2) over time
  25. HoD Sync vs Async Querying a 40x40 gridpoint bbox (25,000 km2) to retrieve 1 year of data Synchronous (previous solution) Asynchronous (new solution) Query count 8,000 1 Total query time 2h 15m 3m 20s
  26. HoD Sync vs Async ▪ Limited storage ▪ Land only ▪ 20 weather properties ▪ Query result size limit of 2,400 records ▪ Unlimited storage ▪ Global coverage ▪ All 34 weather properties ▪ Unlimited query result size ▪ An order of magnitude reduction in cost Asynchronous (new solution) Synchronous (previous solution)
  27. Conclusions ▪ Order of magnitude cost reduction ▪ Order of magnitude performance improvements ▪ Enhanced functionality ▪ Key factors: ▪ Serverless approach with IBM Cloud SQL Query + COS 1. Seamless integration with powerful geospatial library 2. Fully integrated Catalog 3. Geospatial data skipping ▪ Our data skipping work is extensible
  28. Thanks! Contact Info: Erik Goepfert erik.goepfert@ibm.com Paula Ta-Shma paula@il.ibm.com Thanks to the team : Ofer Biran, Dat Bui, Linsong Chu, Patrick Dantressangle, Pranita Dewan, Michael Factor, Oshrit Feder, Raghu Ganti, Michael Haide, Holly Hassenzahl, Pete Ihlenfeldt, Guy Khazma, Simon Laws, Gal Lushi, Yosef Moatti, Jeremy Nachman, Daniel Pittner, Mudhakar Srinivasta,Torsten Steinbach The research leading to these results has received funding from the European Community’s Horizon 2020 research and innovation program under grant agreement n° 779747.
  29. Feedback Don’t forget to rate and review this session Take a look at our latest blog
Advertisement