Processing Planetary Sized Datasets

•Download as PPTX, PDF•

0 likes•383 views

In my group at Microsoft, we have worked with the United Nations, Guide Dogs for the Blind in the UK, and Ströer in Germany on a number of projects involving high scale data. In this talk, I'll share some of the best practices and patterns that have come out of those experiences: best practices for storing and indexing geospatial data at scale, incremental ingestion and slice processing of the data, efficiently building and presenting progressive levels of detail on a web and mobile. The audience will walk away with an understanding of how to efficiently summarize data over a geographic area, general methods for doing incremental updates to large scale datasets with Apache Spark, and best practices around precomputing high scale frontend data views.

Software

Vehicle Id Trip Id Timestamp Latitude Longitude Altitude
…
10152875766888406 57169639 1445377623000 36.966819 -122.012298 1809
10152875766888406 57169639 1445377625000 36.966845 -122.012248 1809
10152875766888406 57169639 1445377627000 36.966877 -122.012228 1814
10152875766888406 57169639 1445377629000 36.966913 -122.012236 1814
10152875766888406 57169639 1445377630000 36.966946 -122.012236 1814
10152875766888406 57169639 1445377631000 36.966984 -122.012263 1815
10152875766888406 57169639 1445377632000 36.967027 -122.012281 1815
…

39 TB of raw location data:
• 584 billion data points
• 116 million trips

• Many trips per
vehicle.
• Want to be able
to pull a range
of locations by
timestamp for
trip display.

Vehicle Id Timestamp Latitude Longitude
…
10152875766888406 1445374423000 36.966819 -122.012298
10152875766888406 1445377625000 36.966845 -122.012248
10152875766888406 1445377627000 36.966877 -122.012228
10152875766888406 1445377629000 36.966913 -122.012236
10152875766888406 1445377630000 36.966946 -122.012236
10152875766888406 1445377631000 36.966984 -122.012263
10152875766888406 1445379512000 36.967027 -122.012281
…

This is a challenge with a large dataset:
• A traditional relational database typically requires
hand sharding to scale to PBs of data (eg. Postgres).
• Highly indexed non relational solutions can be very
expensive (eg. MongoDB).
• Lightly indexed solutions are a good fit because we
really only have one query we need to execute
against the data. (HBase, Cassandra, and Azure
Table Storage)

PartitionKey
(vehicleId)
RowKey
(timestamp)
Latitude Longitude
10152875766888406 1445377623000 36.966819 -122.012298
10152875766888406 1445377625000 36.966845 -122.012248
10152875766888406 1445377627000 36.966877 -122.012228
10152875766888406 1445377629000 36.966913 -122.012236
10152875766888406 1445377630000 36.966946 -122.012236
…

• Want to query a
set of trip in a
bounding box.
• Also want to
filter activities
based on
distance and
duration.

Trip Id start (sec) finish (sec) distance (m) duration (m) bbox
(geometry)
101528 1445377625 1445383025 50023 6222 [-104.990,
39.7392...
101643 1445362577 1445373616 28778 2498 [-122.01228,
36.96…
101843 1445377627 1445382432 4629 701 [0.1278, 51.5074
…
101901 1445362577 1445374713 99691 14232 [139.6917,
35.699...
102102 1445374713 1445374713 25259 6657 [1.3521,
103.8129…

user Id timestamp latitude longitude
10152875766888406 1445377623 36.966819 -122.012298
10152875766888406 1445377625 36.966845 -122.012248
…
10152875766888406 1445383025 36.966913 -122.012236
10152875766888406 1445383030 36.966946 -122.012236
activity
id
start finish … bbox
101528 1445362577 1445373616 … [-104.990, 39.7392...
101643 1445377625 1445383025 … [-122.01228, 36.96…
101843 1445377627 1445382432 … [0.1278, 51.5074 …
101901 1445362577 1445374713 … [139.6917, 35.699...
102102 1445374713 1445374713 … [1.3521, 103.8129…
Location Data
(Azure Table Storage)
Trip Data
(Postgres + PostGIS)

• Total number
of location
samples in a
geographical
area.
• Whole dataset
operation.

• Divides world
up into tiles.
• Each tile has
four children at
the next higher
zoom level.
• Maps 2
dimension
space to 1
dimension.

• Can think of it is as “Hadoop the Next
Generation”
• Better performance (10-100x)
• Cleaner programming model
• Used HDInsight Spark (Azure) to avoid
operational difficulties of running our own Spark
cluster.

For each location, map to tiles at every zoom level:
(36.9741, -122.0308)  [
(10_398_164, 1), (11_797_329, 1)
(12_1594_659, 1), (13_3189_1319, 1),
(14_6378_2638, 1), (15_12757_5276,1),
(16_25514_10552, 1), (17_51028_21105, 1),
(18_102057_42211, 1)
]

Reduce all these mappings with the same key into an
aggregate value:
(10_398_164, 151)  [
(10_398_164, 1), (10_398_164, 1), …
(10_398_164, 1), (10_398_164, 1), …
(10_398_164, 1), (10_398_164, 1), …
(10_398_164, 1), (10_398_164, 1), …
(10_398_164, 1)
]

def tile_id_mapper(location):
tileMappings = []
tileIds = Tile.tile_ids_for_zoom_levels(
location['latitude'],
location['longitude'],
MIN_ZOOM_LEVEL,
MAX_ZOOM_LEVEL
)
for tileId in tileIds:
tileMappings.append(
(tileId, 1)
)
return tileMappings

lines = sc.textFile('wasb://locations@loc.blob.core.windows.net/')
locations = lines.flatMap(json_loader)
heatmap = locations
.flatMap(tile_id_mapper)
.reduceByKey(lambda agg1,agg2: agg1+agg2)
heatmap.saveAsTextFile('wasb://heatmap@loc.blob.core.windows.net/');
Building the heatmap then boils down to this in Spark:

6_25_31
9_201_249
9_201_250
9_201_248
9_201_245
9_201_247
8_100_124
8_100_125
8_100_126
7_50_62
7_50_63
7_50_64

• geotile: http://github.com/timfpark/geotile
• XYZ tile math in C#, JavaScript, and Python
• heatmap: http://github.com/timfpark/heatmap
• Spark code for building heatmaps
• tileIndex:
http://github.com/timfpark/tileIndexPusher
• Azure Function for pushing tile indexes.

- The document provides a snapshot of various Indian stock market indices as of 18-Nov-2013 at 4:00 PM, including the S&P BSE SENSEX (30 stocks), Nifty 50 (CNX NIFTY), Bank Nifty, and other sectoral indices. - It lists each index's current value, day's high and low, change from the previous day's close in both absolute and percentage terms, and other statistical values like 52-week highs and lows. - Most indices saw gains over the previous day's close, with the S&P Bankex up 3.15% and Bank Nifty rising 3.06%, reflecting strength in the banking sector.

Job 7

miguel angel huayhuani cardenas

Naca 4 digit-delta

Phan Quoc Thien

This document contains data points that define the profile of a NACA four-digit airfoil, including its maximum camber, position of maximum camber, thickness, and coordinates for 25 data points each on the upper and lower surfaces. It notes that cosine values were used to generate evenly spaced x-coordinates along the chord and explains that changing the number of data points would require manually adding or removing rows.

Indices 06 nov2013050056

Investors Empowered

- The document provides a snapshot of various Indian stock market indices as of 06-Nov-2013 including the S&P BSE SENSEX, NSE Nifty 50, and other sectoral indices. It lists the current value, day's high and low, previous close, change in points and percentage for each index. - It also provides the 52-week high and low values and price-earnings ratio for each index to give context on their current performance and valuation. - The indices cover various sectors of the Indian economy like banks, automobiles, IT, healthcare, infrastructure, metals, and small/mid cap companies. This snapshot offers a holistic view of the movement of the overall and sector

Economic Feasibility Study for Highway 640

Sadegh Tabatabaei

The document presents an economic feasibility study comparing two alternatives for Highway 640: 1) A traditional highway design 2) A sustainable highway design It provides initial cost estimates and annual operation and maintenance cost projections for 30 years for each alternative. The sustainable highway design has higher initial costs but lower projected annual costs, resulting in it being the recommended option.

Polar degrees

Rodrigo Dorantes

Real time machine learning

Vinoth Kannan

The document discusses real-time machine learning using the Lambda architecture. It describes the need for models that can learn incrementally from streaming data and remain accurate over time. The Lambda architecture is introduced as having a speed layer for real-time processing, a serving layer to query current and batch views, and a batch layer for immutable datasets. Mahout is described as an Apache library for scalable machine learning like recommendation, clustering, and classification using Hadoop. Basic recommendation algorithms are covered along with use cases like e-commerce personalization, fraud detection, and media metadata generation.

Multiple-criteria decision-making (MCDM) or multiple-criteria decision analysis (MCDA) is a sub-discipline of operations research that explicitly evaluates multiple conflicting criteria in decision making (both in daily life or in professional settings). Conflicting criteria are typical in evaluating options: cost or price is usually one of the main criteria, and some measure of quality is typically another criterion, easily in conflict with the cost. In order to survive in the present day global competitive environment, it now becomes essential for the manufacturing organisations to take timely and accurate decisions regarding effective use of their scarce resources. Various multicriteria decision-making (MCDM) methods are now available to help those organisations in choosing the best decisive course of actions. In this project work, the applicability of some newly developed MCDM methods will be explored while solving some discrete manufacturing decision making problems. Integrated decision-making framework will also be developed for effective decision-making. Ranking performances of these methods will also be compared. Decision making that deals with several aspects of a finite set of available alternatives in a given situation is often referred to as multi criteria analysis.

Analisis gap dan thurstone

M Taufiq Budi H

The document contains data from surveys measuring student perception and expectation of service quality across 5 dimensions for a university library. It includes tables of data and calculates the gap between perception and expectation for each dimension and individual aspects within each dimension. The dimensions analyzed are tangibles, reliability, responsiveness, assurance, and empathy. Overall, reliability and empathy had the largest negative gaps, while tangibles and responsiveness had smaller negative gaps, indicating the most room for improvement in how the library meets student expectations for those dimensions.

Data Science Is More Than Just Statistics

Digital Transformation EXPO Event Series

This document discusses how data science involves more than just statistics. It provides examples of how computation can be used to find things to count in text and images, inject context using data from London bike stations and car accident data, change viewpoints such as analyzing data from a supersonic car, and inject new viewpoints like exploring finance portfolio correlations. Computation is a key part of data science that involves techniques beyond just statistics like machine learning, visualization, and other domains.

Processing planetary sized datasets

Tim Park

Using amazon machine learning to identify trends in io t data technical 201

Amazon Web Services

Internet of Things is creating a tidal wave of new data including events, correlations, business value, and much more. With the proliferation of new data sets, it also introduces more potential issues, errors, and spurious values. In this session, we will explore using Amazon Machine Learning to analyse and understand the new data collected within your IoT solution. In addition, we will learn how to discover patterns, trends, anomalies, and correlations by demonstrating the capabilities of Amazon Machine Learning and SparkML running on AWS Cloud. Speaker: Simon Elisha, Solutions Architect, Amazon Web Services

Using Amazon Machine Learning to Identify Trends in IoT Data - Technical 201

Amazon Web Services

Sample Calculations for solar rooftop project in India

disruptiveenergy

This document provides a financial analysis of a proposed solar PV rooftop project over 25 years. It outlines key steps and parameters to consider, including net metering rules, installation costs, internal rate of return, return on investment, payback period, lifetime savings, and discount rate. Tables show projected annual energy generation, costs, revenues, and financial metrics to help evaluate the project's feasibility and profitability. The analysis aims to help consumers make informed decisions without being misled by inaccurate information from solar companies or officials.

BMW 740li report

Mariraja Ponraj

The document provides specifications and performance calculations for a BMW 740li vehicle. It includes: 1) Engine specifications such as displacement, power, torque, and transmission gear ratios. 2) Chassis dimensions for the vehicle such as wheelbase, track, length, width, height, and weight. 3) Calculations of forces, resistance, acceleration, velocity, and brake power at different engine speeds and gear ratios. 4) Graphs showing the relationships between acceleration vs velocity, force vs velocity, torque vs engine speed, and brake power vs engine speed. 5) The performance of the vehicle is analyzed based on the graphs, concluding it can reach 100km/hr in 6-

Funcion solver

Marcelo Jose Cuentas Mercado

Research data

Data_Center_Researchers

Engineering Simulation

Engineering Software

This document contains information about engineering simulations of fluid reservoirs including: - Basic plots used to model fluid height over time using the Euler integration method - A schematic layout of a fluid reservoir showing fluid inflow and outflow - An equation relating the change in fluid height over time to the inlet and outlet mass flows - A sample simulation run with results showing fluid height decreasing over time as inlet flow stops and outlet flow continues

Volvo EC55B Compact Excavator Service Repair Manual.pdf

bin971209zhou

This document provides specifications and information for components of an excavator. It includes: 1) Locations and descriptions of 35 major components of the excavator. 2) Conversion tables for common units of length, area, volume, weight, pressure, torque, power, energy, velocity, and temperature. 3) Specifications for the start switch including maximum current and wire specifications. 4) Specifications for the battery disconnector switch including operating voltage. 5) Tightening torque specifications for screws, nuts, and other components.

Volvo EC55B Compact Excavator Service Repair Manual

fujdfjjskrtekme

Volvo EC55B Compact Excavator Service Repair Manual.pdf

f8usejkdmdd8i

This document provides specifications and information for components of an excavator. It includes: 1) Locations and descriptions of 35 major components of the excavator. 2) Conversion tables for common units of length, area, volume, weight, pressure, torque, power, energy, velocity, and temperature. 3) Specifications for the start switch including maximum current and wire specifications. 4) Specifications for the battery disconnector switch including operating voltage. 5) Tightening torques for mounting screws and other components along with standard tightening torques for various screw sizes.

Volvo EC55B Compact Excavator Service Repair Manual.pdf

fjskemdmmded

This document provides specifications and information for components of an excavator. It includes: 1) Locations and descriptions of 35 major components of the excavator. 2) Conversion tables for common units of length, area, volume, weight, pressure, torque, power, energy, velocity, and temperature. 3) Specifications for the start switch, including maximum current and wire specifications. 4) Specifications for the battery disconnector switch, which has an operating voltage of 6-36V. 5) Standard tightening torques for screws and nuts of various sizes.

Volvo EC55B Compact Excavator Service Repair Manual.pdf

fujsekmdd9dik

Volvo EC55B Compact Excavator Service Repair Manual.pdf

fujsekmd9idd1

Volvo EC55B Compact Excavator Service Repair Manual.pdf

fujsekddmdmdm

This document provides service information for an excavator including: 1) Locations of key components on the excavator and diagrams labeling each part. 2) Conversion tables for common measurement units of length, area, volume, weight, pressure, temperature, flow rate and other units. 3) Specifications for start switches, battery disconnect switches, and standard tightening torques for different screw sizes.

Volvo EC55B Compact Excavator Service Repair Manual.pdf

fyhsejkdm8u

This document provides specifications and information for components of an excavator. It includes: 1) Locations and descriptions of 35 major components of the excavator. 2) Conversion tables for common units of length, area, volume, weight, pressure, torque, power, energy, velocity, and temperature. 3) Specifications for the start switch including maximum current and wire specifications. 4) Specifications for the battery disconnector switch including operating voltage. 5) Tightening torques for mounting screws and other components along with standard tightening torques for various screw sizes.

Volvo EC55B Compact Excavator Service Repair Manual.pdf

ttf99929781

Project Management: The Role of Project Dashboards.pdf

Karya Keeper

Project management is a crucial aspect of any organization, ensuring that projects are completed efficiently and effectively. One of the key tools used in project management is the project dashboard, which provides a comprehensive view of project progress and performance. In this article, we will explore the role of project dashboards in project management, highlighting their key features and benefits.

ppt on the brain chip neuralink.pptx

Reetu63

Similar to Processing Planetary Sized Datasets

PROJECT 8th SEM - DEVELOPMENT OF SOME INTEGRATED DECISION-MAKING FRAMEWORK FO...

Swagatam Mitra

Analisis gap dan thurstone

M Taufiq Budi H

Data Science Is More Than Just Statistics

Digital Transformation EXPO Event Series

Processing planetary sized datasets

Tim Park

Using amazon machine learning to identify trends in io t data technical 201

Amazon Web Services

Using Amazon Machine Learning to Identify Trends in IoT Data - Technical 201

Amazon Web Services

Sample Calculations for solar rooftop project in India

disruptiveenergy

BMW 740li report

Mariraja Ponraj

Funcion solver

Marcelo Jose Cuentas Mercado

Research data

Data_Center_Researchers

Engineering Simulation

Engineering Software

Volvo EC55B Compact Excavator Service Repair Manual.pdf

bin971209zhou

Volvo EC55B Compact Excavator Service Repair Manual

fujdfjjskrtekme

Volvo EC55B Compact Excavator Service Repair Manual.pdf

f8usejkdmdd8i

This document provides specifications and information for components of an excavator. It includes: 1) Locations and descriptions of 35 major components of the excavator. 2) Conversion tables for common units of length, area, volume, weight, pressure, torque, power, energy, velocity, and temperature. 3) Specifications for the start switch including maximum current and wire specifications. 4) Specifications for the battery disconnector switch including operating voltage. 5) Tightening torques for mounting screws and other components along with standard tightening torques for various screw sizes.

Volvo EC55B Compact Excavator Service Repair Manual.pdf

fjskemdmmded

This document provides specifications and information for components of an excavator. It includes: 1) Locations and descriptions of 35 major components of the excavator. 2) Conversion tables for common units of length, area, volume, weight, pressure, torque, power, energy, velocity, and temperature. 3) Specifications for the start switch, including maximum current and wire specifications. 4) Specifications for the battery disconnector switch, which has an operating voltage of 6-36V. 5) Standard tightening torques for screws and nuts of various sizes.

Volvo EC55B Compact Excavator Service Repair Manual.pdf

fujsekmdd9dik

Volvo EC55B Compact Excavator Service Repair Manual.pdf

fujsekmd9idd1

Volvo EC55B Compact Excavator Service Repair Manual.pdf

fujsekddmdmdm

Volvo EC55B Compact Excavator Service Repair Manual.pdf

fyhsejkdm8u

This document provides specifications and information for components of an excavator. It includes: 1) Locations and descriptions of 35 major components of the excavator. 2) Conversion tables for common units of length, area, volume, weight, pressure, torque, power, energy, velocity, and temperature. 3) Specifications for the start switch including maximum current and wire specifications. 4) Specifications for the battery disconnector switch including operating voltage. 5) Tightening torques for mounting screws and other components along with standard tightening torques for various screw sizes.

Volvo EC55B Compact Excavator Service Repair Manual.pdf

ttf99929781

Similar to Processing Planetary Sized Datasets (20)

PROJECT 8th SEM - DEVELOPMENT OF SOME INTEGRATED DECISION-MAKING FRAMEWORK FO...

Analisis gap dan thurstone

Data Science Is More Than Just Statistics

Processing planetary sized datasets

Using amazon machine learning to identify trends in io t data technical 201

Using Amazon Machine Learning to Identify Trends in IoT Data - Technical 201

Sample Calculations for solar rooftop project in India

BMW 740li report

Funcion solver

Research data

Engineering Simulation

Volvo EC55B Compact Excavator Service Repair Manual.pdf

Volvo EC55B Compact Excavator Service Repair Manual

Volvo EC55B Compact Excavator Service Repair Manual.pdf

Recently uploaded

Project Management: The Role of Project Dashboards.pdf

Karya Keeper

ppt on the brain chip neuralink.pptx

Reetu63

8 Best Automated Android App Testing Tool and Framework in 2024.pdf

kalichargn70th171

一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理

kgyxske

原版一模一样【微信：741003700 】【(sdsu毕业证书)圣地亚哥州立大学毕业证成绩单】【微信：741003700 】学位证，留信认证（真实可查，永久存档）原件一模一样纸张工艺/offer、雅思、外壳等材料/诚信可靠,可直接看成品样本，帮您解决无法毕业带来的各种难题！外壳，原版制作，诚信可靠，可直接看成品样本。行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备。十五年致力于帮助留学生解决难题，包您满意。本公司拥有海外各大学样板无数，能完美还原。 1:1完美还原海外各大学毕业材料上的工艺：水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠。文字图案浮雕、激光镭射、紫外荧光、温感、复印防伪等防伪工艺。材料咨询办理、认证咨询办理请加学历顾问Q/微741003700 【主营项目】一.毕业证【q微741003700】成绩单、使馆认证、教育部认证、雅思托福成绩单、学生卡等！二.真实使馆公证(即留学回国人员证明,不成功不收费) 三.真实教育部学历学位认证（教育部存档！教育部留服网站永久可查）四.办理各国各大学文凭(一对一专业服务,可全程监控跟踪进度) 如果您处于以下几种情况： ◇在校期间，因各种原因未能顺利毕业……拿不到官方毕业证【q/微741003700】 ◇面对父母的压力，希望尽快拿到； ◇不清楚认证流程以及材料该如何准备； ◇回国时间很长，忘记办理； ◇回国马上就要找工作，办给用人单位看； ◇企事业单位必须要求办理的 ◇需要报考公务员、购买免税车、落转户口 ◇申请留学生创业基金留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才办理(sdsu毕业证书)圣地亚哥州立大学毕业证【微信：741003700 】外观非常简单，由纸质材料制成，上面印有校徽、校名、毕业生姓名、专业等信息。办理(sdsu毕业证书)圣地亚哥州立大学毕业证【微信：741003700 】格式相对统一，各专业都有相应的模板。通常包括以下部分：校徽：象征着学校的荣誉和传承。校名:学校英文全称授予学位：本部分将注明获得的具体学位名称。毕业生姓名：这是最重要的信息之一，标志着该证书是由特定人员获得的。颁发日期：这是毕业正式生效的时间，也代表着毕业生学业的结束。其他信息：根据不同的专业和学位，可能会有一些特定的信息或章节。办理(sdsu毕业证书)圣地亚哥州立大学毕业证【微信：741003700 】价值很高，需要妥善保管。一般来说，应放置在安全、干燥、防潮的地方，避免长时间暴露在阳光下。如需使用，最好使用复印件而不是原件，以免丢失。综上所述，办理(sdsu毕业证书)圣地亚哥州立大学毕业证【微信：741003700 】是证明身份和学历的高价值文件。外观简单庄重，格式统一，包括重要的个人信息和发布日期。对持有人来说，妥善保管是非常重要的。

What’s New in Odoo 17 – A Complete Roadmap

Envertis Software Solutions

Odoo releases a new update every year. The latest version, Odoo 17, came out in October 2023. It brought many improvements to the user interface and user experience, along with new features in modules like accounting, marketing, manufacturing, websites, and more. The Odoo 17 update has been a hot topic among startups, mid-sized businesses, large enterprises, and Odoo developers aiming to grow their businesses. Since it is now already the first quarter of 2024, you must have a clear idea of what Odoo 17 entails and what it can offer your business if you are still not aware of it. This blog covers the features and functionalities. Explore the entire blog and get in touch with expert Odoo ERP consultants to leverage Odoo 17 and its features for your business too. An Overview of Odoo ERP Odoo ERP was first released as OpenERP software in February 2005. It is a suite of business applications used for ERP, CRM, eCommerce, websites, and project management. Ten years ago, the Odoo Enterprise edition was launched to help fund the Odoo Community version. When you compare Odoo Community and Enterprise, the Enterprise edition offers exclusive features like mobile app access, Odoo Studio customisation, Odoo hosting, and unlimited functional support. Today, Odoo is a well-known name used by companies of all sizes across various industries, including manufacturing, retail, accounting, marketing, healthcare, IT consulting, and R&D. The latest version, Odoo 17, has been available since October 2023. Key highlights of this update include: Enhanced user experience with improvements to the command bar, faster backend page loading, and multiple dashboard views. Instant report generation, credit limit alerts for sales and invoices, separate OCR settings for invoice creation, and an auto-complete feature for forms in the accounting module. Improved image handling and global attribute changes for mailing lists in email marketing. A default auto-signature option and a refuse-to-sign option in HR modules. Options to divide and merge manufacturing orders, track the status of manufacturing orders, and more in the MRP module. Dark mode in Odoo 17. Now that the Odoo 17 announcement is official, let’s look at what’s new in Odoo 17! What is Odoo ERP 17? Odoo 17 is the latest version of one of the world’s leading open-source enterprise ERPs. This version has come up with significant improvements explained here in this blog. Also, this new version aims to introduce features that enhance time-saving, efficiency, and productivity for users across various organisations. Odoo 17, released at the Odoo Experience 2023, brought notable improvements to the user interface and added new functionalities with enhancements in performance, accessibility, data analysis, and management, further expanding its reach in the market.

一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理

dakas1

UMN硕士毕业证成绩单【微信95270640】购买（明尼苏达大学毕业证成绩单硕士学历）Q微信95270640代办UMN学历认证留信网伪造明尼苏达大学学位证书精仿明尼苏达大学本科/硕士文凭证书补办明尼苏达大学 diplomaoffer,Transcript购买明尼苏达大学毕业证成绩单购买UMN假毕业证学位证书购买伪造明尼苏达大学文凭证书学位证书,专业办理雅思、托福成绩单，学生ID卡，在读证明，海外各大学offer录取通知书，毕业证书，成绩单，文凭等材料:1:1完美还原毕业证、offer录取通知书、学生卡等各种在读或毕业材料的防伪工艺（包括烫金、烫银、钢印、底纹、凹凸版、水印、防伪光标、热敏防伪、文字图案浮雕，激光镭射，紫外荧光，温感光标）学校原版上有的工艺我们一样不会少，不论是老版本还是最新版本，都能保证最高程度还原，力争完美以求让所有同学都能享受到完美的品质服务。 #毕业证成绩单 #毕业証 #成绩单 #學生卡 #OFFER录取通知书 #雅思#托福等…… 国外大学明尼苏达大学明尼苏达大学毕业证offer制作方法（一对一专业服务） 1客户提供办理信息：姓名生日专业学位毕业时间等（如信息不确定可以咨询顾问：我们有专业老师帮你查询）； 2开始安排制作毕业证成绩单电子图； 3毕业证成绩单电子版做好以后发送给您确认； 4毕业证成绩单电子版您确认信息无误之后安排制作成品； 5成品做好拍照或者视频给您确认； 6快递给客户（国内顺丰国外DHLUPS等快读邮寄） — — 制作工艺【高仿真】— — 凭借多年的制作经验本公司制作明尼苏达大学明尼苏达大学毕业证offer《激光》《水印》《钢印》《烫金》《紫外线》凹凸版uv版等防伪技术一流高精仿度几乎跟学校100%相同！让您绝对满意。 — — -公司理念【诚信为主】— — — 我們以質量求生存.以服务求发展有雄厚的实力专业的团队咨询顾问为您细心解答可详谈是真是假眼见为实让您真正放心平凡人生,尽我所能助您一臂之力让我們携手圆您梦想! 此贴长年有效【贴心专线/微-信: 95270640】敬请保留此联系方式以备用！如有不在线请给我们留言！我们将在第一时间给您回复!上散发着一抹抹的光晕而这每处自然形成的细节融合在一起浑然天成的美实在令人心生愉悦小道的周边无秩序的生长着几株艳丽的野花红的粉的紫的虽混乱无章却给这幅美景更增添一份性感夹杂着一份纯洁的妖娆毫无违和感实在给人带来一份悠然幸福的心情如果说现在的审美已经断然拒绝了无声的话那么在树林间飞掠而过的小鸟叽叽咋咋的叫声是否就是这最后的点睛之笔悠然走在林间的小路上宁静与清香一丝丝的盛夏气息吸入身体昔日生活里的繁忙多

Mobile App Development Company In Noida | Drona Infotech

Drona Infotech

UI5con 2024 - Bring Your Own Design System

Peter Muessig

How do you combine the OpenUI5/SAPUI5 programming model with a design system that makes its controls available as Web Components? Since OpenUI5/SAPUI5 1.120, the framework supports the integration of any Web Components. This makes it possible, for example, to natively embed own Web Components of your design system which are created with Stencil. The integration embeds the Web Components in a way that they can be used naturally in XMLViews, like with standard UI5 controls, and can be bound with data binding. Learn how you can also make use of the Web Components base class in OpenUI5/SAPUI5 to also integrate your Web Components and get inspired by the solution to generate a custom UI5 library providing the Web Components control wrappers for the native ones.

Using Query Store in Azure PostgreSQL to Understand Query Performance

Grant Fritchey

Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...

The Third Creative Media

"Navigating Invideo: A Comprehensive Guide" is an essential resource for anyone looking to master Invideo, an AI-powered video creation tool. This guide provides step-by-step instructions, helpful tips, and comparisons with other AI video creators. Whether you're a beginner or an experienced video editor, you'll find valuable insights to enhance your video projects and bring your creative ideas to life.

Enhanced Screen Flows UI/UX using SLDS with Tom Kitt

Peter Caitens

Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...

XfilesPro

UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem

Peter Muessig

Microservice Teams - How the cloud changes the way we work

Sven Peters

A lot of technical challenges and complexity come with building a cloud-native and distributed architecture. The way we develop backend software has fundamentally changed in the last ten years. Managing a microservices architecture demands a lot of us to ensure observability and operational resiliency. But did you also change the way you run your development teams? Sven will talk about Atlassian’s journey from a monolith to a multi-tenanted architecture and how it affected the way the engineering teams work. You will learn how we shifted to service ownership, moved to more autonomous teams (and its challenges), and established platform and enablement teams.

E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies

Quickdice ERP

Energy consumption of Database Management - Florina Jonuzi

Green Software Development

Enums On Steroids - let's look at sealed classes !

Marcin Chrost

Oracle Database 19c New Features for DBAs and Developers.pptx

Remote DBA Services

DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS

Tier1 app

Are you ready to unlock the secrets hidden within Java thread dumps? Join us for a hands-on session where we'll delve into effective troubleshooting patterns to swiftly identify the root causes of production problems. Discover the right tools, techniques, and best practices while exploring *real-world case studies of major outages* in Fortune 500 enterprises. Engage in interactive lab exercises where you'll have the opportunity to troubleshoot thread dumps and uncover performance issues firsthand. Join us and become a master of Java thread dump analysis!

J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...

Bert Jan Schrijver

Recently uploaded (20)

Project Management: The Role of Project Dashboards.pdf

ppt on the brain chip neuralink.pptx

8 Best Automated Android App Testing Tool and Framework in 2024.pdf

一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理

What’s New in Odoo 17 – A Complete Roadmap

一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理

Mobile App Development Company In Noida | Drona Infotech

UI5con 2024 - Bring Your Own Design System

Using Query Store in Azure PostgreSQL to Understand Query Performance

Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...

Enhanced Screen Flows UI/UX using SLDS with Tom Kitt

Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...

UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem

Microservice Teams - How the cloud changes the way we work

E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies

Energy consumption of Database Management - Florina Jonuzi

Enums On Steroids - let's look at sealed classes !

Oracle Database 19c New Features for DBAs and Developers.pptx

DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS

J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...

Processing Planetary Sized Datasets

10.

11.

12.

13.

14.

15.

16. Vehicle Id Trip Id Timestamp Latitude Longitude Altitude … 10152875766888406 57169639 1445377623000 36.966819 -122.012298 1809 10152875766888406 57169639 1445377625000 36.966845 -122.012248 1809 10152875766888406 57169639 1445377627000 36.966877 -122.012228 1814 10152875766888406 57169639 1445377629000 36.966913 -122.012236 1814 10152875766888406 57169639 1445377630000 36.966946 -122.012236 1814 10152875766888406 57169639 1445377631000 36.966984 -122.012263 1815 10152875766888406 57169639 1445377632000 36.967027 -122.012281 1815 …

17.

18. 39 TB of raw location data: • 584 billion data points • 116 million trips

19. • Many trips per vehicle. • Want to be able to pull a range of locations by timestamp for trip display.

20. Vehicle Id Timestamp Latitude Longitude … 10152875766888406 1445374423000 36.966819 -122.012298 10152875766888406 1445377625000 36.966845 -122.012248 10152875766888406 1445377627000 36.966877 -122.012228 10152875766888406 1445377629000 36.966913 -122.012236 10152875766888406 1445377630000 36.966946 -122.012236 10152875766888406 1445377631000 36.966984 -122.012263 10152875766888406 1445379512000 36.967027 -122.012281 …

21. This is a challenge with a large dataset: • A traditional relational database typically requires hand sharding to scale to PBs of data (eg. Postgres). • Highly indexed non relational solutions can be very expensive (eg. MongoDB). • Lightly indexed solutions are a good fit because we really only have one query we need to execute against the data. (HBase, Cassandra, and Azure Table Storage)

22. PartitionKey (vehicleId) RowKey (timestamp) Latitude Longitude 10152875766888406 1445377623000 36.966819 -122.012298 10152875766888406 1445377625000 36.966845 -122.012248 10152875766888406 1445377627000 36.966877 -122.012228 10152875766888406 1445377629000 36.966913 -122.012236 10152875766888406 1445377630000 36.966946 -122.012236 …

23. • Want to query a set of trip in a bounding box. • Also want to filter activities based on distance and duration.

24. Trip Id start (sec) finish (sec) distance (m) duration (m) bbox (geometry) 101528 1445377625 1445383025 50023 6222 [-104.990, 39.7392... 101643 1445362577 1445373616 28778 2498 [-122.01228, 36.96… 101843 1445377627 1445382432 4629 701 [0.1278, 51.5074 … 101901 1445362577 1445374713 99691 14232 [139.6917, 35.699... 102102 1445374713 1445374713 25259 6657 [1.3521, 103.8129…

25. user Id timestamp latitude longitude 10152875766888406 1445377623 36.966819 -122.012298 10152875766888406 1445377625 36.966845 -122.012248 … 10152875766888406 1445383025 36.966913 -122.012236 10152875766888406 1445383030 36.966946 -122.012236 activity id start finish … bbox 101528 1445362577 1445373616 … [-104.990, 39.7392... 101643 1445377625 1445383025 … [-122.01228, 36.96… 101843 1445377627 1445382432 … [0.1278, 51.5074 … 101901 1445362577 1445374713 … [139.6917, 35.699... 102102 1445374713 1445374713 … [1.3521, 103.8129… Location Data (Azure Table Storage) Trip Data (Postgres + PostGIS)

26.

27.

28. • Total number of location samples in a geographical area. • Whole dataset operation.

29. • Divides world up into tiles. • Each tile has four children at the next higher zoom level. • Maps 2 dimension space to 1 dimension.

30. • Can think of it is as “Hadoop the Next Generation” • Better performance (10-100x) • Cleaner programming model • Used HDInsight Spark (Azure) to avoid operational difficulties of running our own Spark cluster.

31. For each location, map to tiles at every zoom level: (36.9741, -122.0308)  [ (10_398_164, 1), (11_797_329, 1) (12_1594_659, 1), (13_3189_1319, 1), (14_6378_2638, 1), (15_12757_5276,1), (16_25514_10552, 1), (17_51028_21105, 1), (18_102057_42211, 1) ]

32. Reduce all these mappings with the same key into an aggregate value: (10_398_164, 151)  [ (10_398_164, 1), (10_398_164, 1), … (10_398_164, 1), (10_398_164, 1), … (10_398_164, 1), (10_398_164, 1), … (10_398_164, 1), (10_398_164, 1), … (10_398_164, 1) ]

33. def tile_id_mapper(location): tileMappings = [] tileIds = Tile.tile_ids_for_zoom_levels( location['latitude'], location['longitude'], MIN_ZOOM_LEVEL, MAX_ZOOM_LEVEL ) for tileId in tileIds: tileMappings.append( (tileId, 1) ) return tileMappings

34. lines = sc.textFile('wasb://locations@loc.blob.core.windows.net/') locations = lines.flatMap(json_loader) heatmap = locations .flatMap(tile_id_mapper) .reduceByKey(lambda agg1,agg2: agg1+agg2) heatmap.saveAsTextFile('wasb://heatmap@loc.blob.core.windows.net/'); Building the heatmap then boils down to this in Spark:

35.

36. …

37.

38.

39.

40.

41.

42.

43.

44.

45.

46.

47.

48.

49.

50. Geolocated Tweets Summary Updates

51.

52.

53. 6_25_31 9_201_249 9_201_250 9_201_248 9_201_245 9_201_247 8_100_124 8_100_125 8_100_126 7_50_62 7_50_63 7_50_64

54. 39 TB 23 TB JSON Avro 60% Smaller

55. • geotile: http://github.com/timfpark/geotile • XYZ tile math in C#, JavaScript, and Python • heatmap: http://github.com/timfpark/heatmap • Spark code for building heatmaps • tileIndex: http://github.com/timfpark/tileIndexPusher • Azure Function for pushing tile indexes.

56. Tim Park @timpark

Editor's Notes

Thanks… I thought I would start today by briefly looking at where we’ve been as an industry, and how that has shaped our practice of building software
So, in the beginning there was a computer. It filled a room and was something that only a government had access to. It was very sexily called the ENIAC and it did mathematically computations at the blazing speed of 360 calculations per second Required an army of people in suits to keep it operating. It was also central to the scientific understanding of how far neutrons could penetrate matter before hitting a nucleus and how much energy it would give off when it did. To calculate this, the engineers on the project to invent the Monte Carlo Method to accomplish this, which basically uses random samples of a problem to arrive at a converged numerical result. We still use this technique today in software profiling.
Over a decade or so, computers got smaller and only took up half a room. They became accessible in price to universities and large corporations. 1000x faster than the first computer. 1000x more access by developers to computers. Let to many more geniuses at the keyboard (like Kernigan and Riche here) Their desire to be more productive lead to an explosion of useful software and techniques that we still use today: First modern systems language in C Modern operating systems like UNIX We still model many of our programming languages and operating systems off of this work.
All of these techniques trickled down into the original PCs like the IBM PC and the Macintosh. These computers put computing in hands of ordinary people, workers, students, and hackers. This lead to the first widespread usage of computers by non-engineers and lead to a wide reinvention of a wide number of tasks
Writers moved from typewriters to…
… To word processing …
Accountants and inventory management moved from error prone paper…
…To spreadsheets and enterprise resource management systems The big point here is that the personal computer enabled the automation of a wide range of office work tasks that were largely abstract. And we as developers invented things like filesystems to manage files and B-trees to enable fast queries against databases. And this is the big meta point. At every point of the evolution of the computing industry, we have had to investigate and discover what have become widely applicable approaches to building applications based on solving practical real world problems.
The mobile phone has obviously taken this technology progression forward another step.
While many people would point to app stores as the most interesting way mobile has changed the software industry. In my opinion, the most interesting thing it has done is expand computing out to the real world... Enabling whole classes of new applications...
Like having a map of the world in our pockets…
…being able to push a button and have a car show up 5 meters away…
And, of course, the all important Pokemon Go, where allows us to catch virtual animals…
In the same way that our predecessors had to figure out the best way to process text and structured data The explosion of mobile, and the fact that many mobile apps revolve around the real 3D world, means that increasingly processing and analyzing geospatial data is woven into the applications that we are building as developers. I work in our developer advocacy team at Microsoft and I really have an awesome job where I get to work with a bunch of customers and partners on their hardest technical challenges. Today I’m going to share some of the things we have learned in the course of those projects around effectively processing geospatial data in the cloud. The first project I wanted to share with you, and the one I’ll use to largely ground this talk, is around a transportation partner that we worked with here in Europe.
This well known transportation company collects location traces for each of the trips that their fleet takes. Here is a visualization of what one of those trips looks like from a prototype we built with them.
The shape of this dataset will probably not surprise you Its basically a whole lot of data that includes: an vehicle id, A bunch of sets of trips that are identified by a Trip id, And a bunch of timestamped location data: latitude, longitude, and altitude data And so, there are many many vehicles Who each have many trips Which all have many timestamped locations.
Its probably not hard for you to imagine that this company ends up with literally a mountain of data. This is pretty common in this space. Collecting geospatial often results in a ton of data landing at your doorstep.
And to give you and idea of this, We worked with them on a month’s worth of data. Even this small time slice of data is still pretty large. A CSV dump of a month, containing 584B locations across 116M trips, is over 39TB in size. The point here is that this is definitely larger than any typical computing node. And therefore we have to use larger scale data techniques in order to store and process it
Let’s first start by talking about how we store the location data As we talked about previously, An trip has an ordered set of locations by timestamp To display an trip, we need to pull all of the associated locations for that trip
So what we need from the storage system is the ability to pull a range of timestamps With this, we can pull all of the locations for a particular trip.
This is pretty standard stuff for a database but becomes a challenge with a large dataset (reasons above)
This brings us to the first pattern that we have used for all of our geospatial projects. And that’s to use a lightly structured storage system for this sort of data. In our projects, we, naturally, almost exclusively use Azure. And for this sort of storage requirement we use Azure Table Storage. For those of you that aren’t familiar with Azure Table Storage, it sits somewhere just above blob storage. And as such, it is very inexpensive, and costs roughly 2 US cents per GB per month. And unlike blob storage, you can access ranges and individual rows in the data. But you can only query on a set of RowKeys within the same PartitionKey. But we only need to query on timestamp. And it satisfies our need to be able to query a range of user locations by timestamp range.
Now that we have an approach for storing locations Let’s look at how we store trip metadata itself One of the key queries we want to do is around querying the trips in a bounding box. We also, in the future, want to be able to filter trips on a distance or duration.
What this essentially means is that unlike the location data, we want the trip data to be highly indexed. We want to be able to do queries like “Give me all of the trips under 10km in length near Dublin” or ”Give me all of the trips over 1 hour in duration near Berlin” And this means we need the data columns highlighted above to be indexed. Which is not a great fit for Azure Table Storage that we saw in the first pattern. But the good news is that there is 10 to 20 thousand times less data as well. Which brings it within range of a traditional relational database like MySQL or Postgres. And, so, we can just set up some schema’ed tables Which allow us to make rich queries against it.
Which brings us to the second pattern for dealing with high scale data like this: Use the best storage system for each scenario that solves a particular application’s needs. We call “polyglot persistence” And what we mean by that is that you shoud choose each storage system because they excel at a portion of the problem that we are trying to solve. As we saw before, I am using Azure Table Storage for the location data. And for the Trip Data, I am using PostgresSQL + PostGIS. The way this works is that we query for trips using Postgres and then when we need the location data, we query for it from Table Storage. Ok, so that is how we are storing and querying trips. We load locations into Table Storage and trips into Postgres at creation and then query them.
As we saw, the data from these vehicle trips Is basically a whole lot of CSV files with location information that includes: User id, activity id (which identifies all of the data that is part of the same run, walk, hike, bike ride, etc.), timestamp, latitude and longitude There are many many users Who each have many activities Which all have many timestamped locations.
So what does the shape of this dataset look like Its basically a whole lot of CSV files with location information that includes: an user id, activity id, timestamp, latitude and longitude There are many many users Who each have many activities Which all have many timestamped locations.
Now that we have talked about how we store a dataset of this size, Let’s dig in and talk more about some of the techniques we can use to process a dataset of this size at scale. The transportation company we worked with also wanted to be able to visualize a heatmap where their fleet spends its time.
This is a pretty common problem. We also tackled this with, Stroeer, an outdoor advertiser in Germany. They are the company that does most of the advertising billboards you see on the sidewalks in urban centers throughout Germany. One of the important factors in advertising is that the overall mood of a particular place can make a particular ad much more or less effective. Given this, Stroer combined a number of datasets, including geotagged social feeds, to come up with an overall estimate of what people were feeling. And then use this plus demographic data to decide which ad to show and how much to charge for each ad. So, building heatmaps like these are a pretty common problem…
Let’s look at how we generated the heatmap for our transportation company, since it is a simpler scenario. In that case, the heatmap is generated by summing up the number of location samples in a particular geographical area. So for every location that a truck sends back, we attribute that location to a summary in the heatmap, and sum up over all of the locations that are associated with that summary.
So before we dive into how we implemented this Let’s discuss how we map a particular location point to a geographic summary. One of the common patterns we’ve seen is using XYZ Tile as a summarization bucket. This is not something we’ve invented. Open Street Maps, Google, and Apple use the same concept for addressing geographical areas in their maps. It is a fairly simple concept. The top level world is divided into 4 tiles. For each zoom level below that, you take the parent tile and recursively divide it into 4 tiles. A individual tile is then addressed by its zoom level and its row and column within that zoom level. And so, in order to build these heatmap summaries, we basically count location samples in these geographic XYZ tiles. This means that it is operating over the whole dataset, and given the large size of the dataset we need to open our big data toolbox to accomplish this.
To accomplish this, we used Apache Spark. For those of you that have used Hadoop, Apache Spark is sort of a “Hadoop the next generation” It operates on data in a similar paradigm But offers much better performance and, in my opinion, a much nicer programming model than the original Hadoop engine. HDInsight Spark is Azure’s hosted version of Apache Spark. We used this hosted offering so we could ignore the significant operational work of running a Spark cluster and focus on the actual problem we are trying to solve.
In order to compute our heatmap, we use a pretty standard map / reduce algorithm. For every location in the dataset, we generate a tile id key/value pair for every zoom level that we want results for. In this case, we are generating tile key/value pairs for the zoom levels from 10 to 18 because we knew that these are the only set of tiles that the user interface would end up using. So for instance, for this location from a vehicle, We generated one for zoom level 10, zoom level 11, … thru zoom level 18.
We then reduce all of these mappings with the same key down into its aggregate heatmap value. In spark, the first element in a tuple is considered the key And the second element is considered the value Which is to say, we take all of the locations with the same tile ids from the previous step and count them.
So let’s look at what an implementation of these concepts look like in Python. We first compute the tileIds for all of the zoom levels that we want to collect results for. And then use that to build tuples for each tileId.
With this mapper, the overall implementation in Spark is fairly straightforward. We point it at the dataset in blob storage, then parse it as json, And then use the tile_id_mapper function we defined earlier to map each location to the appropriate zoom level result. We then reduce all of these individual results by the key to get a final total for each tile result. We implement the reducer as a anonymous lambda function that essentially sums the intermediate aggregates for tiles with the same id. And then write the heatmaps back out to blob storage.
From a programming standpoint, Spark makes this look really easy. But under the covers, Spark is doing a lot of work for us Remember, we are working on a dataset that doesn’t fit into onto a single machine During the map stage, any tile id could be generated by any of the mappers in the Spark cluster since locations are uniformly distributed in the dataset. This means that there needs to be a shuffle step in which the results for a give tile id are assigned to the same reducer so that we can calculate an aggregate value. And therefore there will be potentially billions of these truple floating around across the cluster. The good news is that Spark handles all of this underneath the covers for us.
The next pattern I wanted to talk about is incremental ingestion In the real world, we don't usually have static data but instead data that is constantly arriving. In this case, we have vehicles that are constantly delivering data. And ideally, to make Spark more efficient, we'd like to combine these incoming small trips into a set of large aggregate files. For this we are using Azure Event Hub You can think of Event Hub as a giant cloud buffer where a downstream backend system controlls the rate at which data is read out. Helpful in situations where you have 1) bursty data and 2) where we don't need the data until the future. In this case, we want to use it to buffer up the data for a particular hour time slice And then create hourly summaries of this new data using another Azure service called Stream Analytics. Using these pieces of infrastructure in conjunction with each other to enable incremental ingestion of data is a key pattern for high scale location data.
Incremental ingestion leads us to our next pattern of processing data in slices Which is how do you process these new pieces of incremental data in slices. We do this in a manner analogous to how we processed the whole dataset in the previous slides. But instead, we only operate on the single new data slice. Since we are only processing an individual slice, we do not need to have nearly as large of a cluster to do the processing. We then load in the previous complete result, fold in this newly computed partial heatmap, and then write out the new heatmap. Although this adds a second step, overall we are operating on a much smaller set of data, and therefore it is much more efficient.
The next project I wanted to share with you is some work that we’ve been doing with the United Nations Office for the Coordination of Humanitarian Affairs. This part of the UN is tasked with getting help to the most vulnerable people in the midst of humanitarian crises as quickly as possible.
We worked with them to see how technology could have helped them during the many humanitarian crises that have happened in the Libyan Civil War that has been ongoing since 2014.
Armed conflicts and natural disasters make up nearly equal parts of their remit. These events yield devastated civilian infrastructure…
…and many vulnerable refugees seeking shelter and food.
The state of the art of detecting these sorts of crises is still very human driven. For example, a photojournalist may be documenting the event
And might call back with his observations of where and what help is needed most
The UN then kicks into gear and coordinates a disaster relief effort. But this is a very slow approach – it can be days before critical needs are discovered
And sadly, that is several days too late for some of the victims of these crises
Fortunately, the UN had an idea for how to improve on this, and it leverages the high penetration of mobile in these developing countries.
The idea was pretty simple: Could we search for humanitarian keywords in geolocated tweets and other short messages intersecting and summarizing them against real world features like a city or a state, To build a near real time dashboard to detect more quickly where these crises are occurring.
Geographical features in the real world have complex shapes. This is probably not a surprise to you. The challenge, given this, is how to do these intersections at scale. In the United Nations project we utilized the Open Street Maps dataset to compute intersections using a two stage process. We again employed the XYZ tile to accomplish this. We then loaded and keyed each geolocated tweet. These are the small squares in the slide. We then loaded each open street map feature and keyed it with the tileId that was the maximum zoom level that encompasses the feature. In this case, the black square represents the smallest XYZ tile that spans the Benghazi region completely. In Spark, when you can have two datasets that have keyed tuples with the same units, you can use a join operation to select the elements that exist in both datasets. We use this to join the geolocated tweets (the small rectangles), against the features. This will yield us a set of candidate feature matches. It is not the final set of matches because the black box that represents the spanning tileId for Benghazi does not obviously perfectly fit it. This means that false matches like the red square are included the candidate matches. That said, the join does narrow down the set of potential matches considerably, including the blue square below it. This narrowed set of potential matches allows us to make a second pass, where do a fine grained intersection test against the real border data. So this is how we can do a scaled intersection of the dataset to come up with the set of features each datapoint intersects with so we can aggregate against them. However, as I mentioned previously, the United Nations wanted a system that could react in near real time to humanitarian crises. There are a couple of ways that you can do this. You could do it with Spark Streaming and micro batches, but that can be an expensive way to accomplish this. Maintaining an entire cluster to handle the worst case load means that in the common case a number of the nodes of the cluster will go underutilized. So only really makes sense when you have a pretty consistent volume of data and you can size the cluster appropriately.
Using an entirely batch infrastructure for this can be slow and expensive So instead of using an entirely batch infrastructure to do this, we adopted a Lambda Architecture This is becoming a pretty common pattern in our industry. A lambda architecture pairs a batch layer to do full dataset processing with a speed layer that does near real time stream level processing to keep these results up to date. Projects use this architecture because it enables you to do a full recompute on the dataset to handle, for example, new feature additions. In between those full recomputes, you can use a less expensive mechanism to update the overall view with the latest data.
For the batch layer, we are using, like the previous examples, Apache Spark. This is the architectural diagram that implements the algorithm that I described in the previous slide. From an architectural perspective it is very straightforward: we take in the geolocated tweets and the features, intersect them with a join, and then aggregate the values over the features. And then does a bulk update of the results.
We've talked about what the batch layer for this looks like. The speed processing layer we implemented builds off of the incremental ingestion processing that we described previously. We used a new service in Azure called Azure Functions. Azure Functions enables you to process data as it arrives by applying a function to each of them. In our case, for each tweet that comes in, we find the geographic features they intersect and update the existing dataset. This enables us to have a nearly real time update of the dashboard but if we add features that we want to aggregate over, we can still rerun the batch layer to get summaries over those previous results.
We efficiently implemented this feature intersection service in Azure using a platform called Azure Functions. Azure Functions is a serverless platform, which means that instead of dimensioning the platform in terms of the number of instances that are constantly running… …you instead provide a function that should be executed every time a particular event happens …and the infrastructure handles automatically scaling the number of instances to match the incoming event rate. This slidesshows a very simple example of what one of these functions look like. I wrote it in node.js, but you can write these functions in a variety of languages including C#. Basically, the function receives the tweet data slice that the incremental processing pipeline we previously discussed has dropped into blob storage… ... Breaks it down into individual tweets since the blob contains one per line. … and calls an underlying service to aggregate them on an hourly basis. ... It then stores these such that the frontend that we have that is consuming them can use them. (old) The great news is that Azure has a nice new feature called Functions that does just this. Azure Functions allows you to setup a trigger on a wide variety of events Including queued events, http requests, service bus, timers, etc. It builds on top of our Azure App Service’s Web Jobs functionality To provide a super easy interface When one of these triggers is triggered, a function that you have written is executed. In this case, I wrote my function in JavaScript, but C# and a host of other languages are also supported. I also set up this Azure Function to trigger on a new blob being added to a blob in storage account container. So each time a blob is created by the previous incremental ingestion pattern… ... It is passed into this function so that it can be enriched with elevation data. Note that the whole blob is passed into the function, so you need to make sure that the blobs for your scenario will fit in memory. Once the locations in the blob have been enriched, we then push these out to another blob... ... Which we use downstream.
I wanted to end today by talking about how we present and display this information. Many of you might have suspected it when you saw this the first time but there is a fairly large set of data that sent with each of these heatmap queries. If we made a traditional bounding box database query against a geospatial relational store for each of these heatmaps, you’d likely end up with a solution that didn’t scale well or scaled pretty expensively.
Instead of querying for the heatmaps, we instead precomputed the heatmap elements that should be displayed within a particular view. We use a lower zoom level block, shown here, as a container for the higher zoom level summaries, and then precompute what should be displayed. We store all of these resultsets in blob storage, which is very inexpensive compared to the comparable number of VMs/databases that you’d need to do this. This allows us to turn a querying problem into a “sending json” problem from our web frontends. But also allows us to cache each of these resultsets in the browser.
Architecturally, this looks like this. We use the slice architecture that I talked about and heatmap deltas to determine which heatmaps require updates We then push new heatmap resultsets out to blob storage for each of these. And that’s it – we are trading off using more storage for precomputed views vs. using computation to generate them on the fly – and this is something that you should in general look for in your projects.
One other thing that I glossed over when we described displaying trips. As you remember, the application has an elevation graph associated with the application.
But, also remember, our input data set does not have elevation as part of it.
We’ve worked with Guide Dogs for the Blind to build a device that uses data from Open Street Maps The app helps blind people: * Discover where they are * What's around them * And helps navigate them to locations
That said, while storage is becoming increasingly cheap… … and while JSON is a fantastic format and very developer friendly. … the traditional laws still apply: reading and writing data incurs latency. So when you go to do a project like this for real, consider a binary data serialization format like Avro instead. By establishing a schema, and using a binary serialization format, you can achieve on the order of a 60% size reduction. This improves performance of deserializing and serializing the data from and to blob storage It also, naturally, linearly cuts your data storage costs.
Ok, that’s what I have for you today I’ve open sourced a couple of things as part of this presentation that you should have a look at if you are interested in more details Geotile Heatmap TileIndex: Azure Function for pushing tile indexes Includes a sample dataset that you are work against
I will share out are these slides via Twitter, so follow me @timpark if you’d like to get a copy of those. And with that, thank you for coming out today for the talk, and I’d be happy to take any questions…

Processing Planetary Sized Datasets

Recommended

Recommended

More Related Content

Similar to Processing Planetary Sized Datasets

Similar to Processing Planetary Sized Datasets (20)

Recently uploaded

Recently uploaded (20)

Processing Planetary Sized Datasets

Editor's Notes