GraphReduce is a solution for feature engineering on graph-structured enterprise data for machine learning. It represents tables as nodes in a graph and foreign keys as edges to flatten large datasets. It defines abstractions like cut dates and consideration periods to orient data in time. Nodes can be parameterized for primary keys, dates, file formats, and compute functions. This allows rapid development, testing, and deployment of feature pipelines across many tables for machine learning models. It was successfully used by FreightVerify to build daily updated models for supply chain monitoring from billions of events across over 20 tables.
MOPs & ML Pipelines on GCP - Session 6, RGDCgdgsurrey
MLOps Lifecycle
ML problem framing
ML solution architecture
Data preparation and processing
ML model development
ML pipeline automation and orchestration
ML solution monitoring, optimization, and maintenance
Anoushiravan Ghamsari, known as Anoush Ghamsari is a brilliant architect, the way he uses his creativity to create phenomenal concepts is beyond this world.
The concept of talk is as follows: - to give a general idea about user segmentation task in DMP project and how solving this problem helps our business - to tell how we use autoML to solve this task and to explain its components - to give insights about techniques we apply to make our pipeline fast and stable on huge datasets
SAS Institute on Changing All Four Tires While Driving an AdTech Engine at Fu...ScyllaDB
SAS Intelligent Advertising changed its ad-serving platform from using Datastax Cassandra clusters to Scylla clusters for its real-time visitor data storage. This presentation describes how this migration was executed with no downtime and with no loss of data, even as data was constantly being created or updated.
MOPs & ML Pipelines on GCP - Session 6, RGDCgdgsurrey
MLOps Lifecycle
ML problem framing
ML solution architecture
Data preparation and processing
ML model development
ML pipeline automation and orchestration
ML solution monitoring, optimization, and maintenance
Anoushiravan Ghamsari, known as Anoush Ghamsari is a brilliant architect, the way he uses his creativity to create phenomenal concepts is beyond this world.
The concept of talk is as follows: - to give a general idea about user segmentation task in DMP project and how solving this problem helps our business - to tell how we use autoML to solve this task and to explain its components - to give insights about techniques we apply to make our pipeline fast and stable on huge datasets
SAS Institute on Changing All Four Tires While Driving an AdTech Engine at Fu...ScyllaDB
SAS Intelligent Advertising changed its ad-serving platform from using Datastax Cassandra clusters to Scylla clusters for its real-time visitor data storage. This presentation describes how this migration was executed with no downtime and with no loss of data, even as data was constantly being created or updated.
Distributing Queries the Citus Way | PostgresConf US 2018 | Marco SlotCitus Data
Citus is a sharding extension for postgres that can efficiently distribute a wide range of SQL queries. It uses postgres' planner hook to transparently intercept and plan queries on "distributed" tables. Citus then executes the queries in parallel across many servers, in a way that delegates most of the heavy lifting back to postgres.
Within Citus, we distinguish between several types of SQL queries, which each have their own planning logic:
Local-only queries
Single-node “router” queries
Multi-node “real-time” queries
Multi-stage queries
Each type of query corresponds to a different use case, and Citus implements several planners and executors using different techniques to accommodate the performance requirements and trade-offs for each use case.
This talk will discuss the internals of the different types of planners and executors for distributing SQL on top of postgres, and how they can be applied to different use cases.
This slidedeck from Giragadurai covers the following topics:
• Cloud Native (Scalability, Fault Tolerance, Load Balancing, Security and Monitoring Data)
• Micro Services (Concept, Containers & Inter-Communication Patterns)
• Big Data (Concept, Lambda Architecture)
• IoT Applications (Concept, Device, Data Acquisition, Edge & Deep Processing/Analytics)
Counting Unique Users in Real-Time: Here's a Challenge for You!DataWorks Summit
Finding the number of unique users out of 10 billion events per day is challenging. At this session, we're going to describe how re-architecting our data infrastructure, relying on Druid and ThetaSketch, enables our customers to obtain these insights in real-time.
To put things into context, at NMC (Nielsen Marketing Cloud) we provide our customers (marketers and publishers) real-time analytics tools to profile their target audiences. Specifically, we provide them with the ability to see the number of unique users who meet a given criterion.
Historically, we have used Elasticsearch to answer these types of questions, however, we have encountered major scaling and stability issues.
In this presentation we will detail the journey of rebuilding our data infrastructure, including researching, benchmarking and productionizing a new technology, Druid, with ThetaSketch, to overcome the limitations we were facing.
We will also provide guidelines and best practices with regards to Druid.
Topics include :
* The need and possible solutions
* Intro to Druid and ThetaSketch
* How we use Druid
* Guidelines and pitfalls
As a data mining application, Clementine offers a strategic approach to finding useful relationships in large datasets.
Clementine provides a wide range of data mining techniques, along with pre-built vertical solutions, in an integrated and comprehensive manner, with a special focus on visualization and case-of-use.
Working with Clementine is a three-step process of working with data.
• First, you read data into Clementine,
• Then, run the data through a series of manipulations,
• And finally, send the data to a destination.
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...NETWAYS
At Uber we use high cardinality monitoring to observe and detect issues with our 4,000 microservices running on Mesos and across our infrastructure systems and servers. We’ll cover how we put the resulting 6 billion plus time series to work in a variety of different ways, auto-discovering services and their usage of other systems at Uber, setting up and tearing down alerts automatically for services, sending smart alert notifications that rollup different failures into individual high level contextual alerts, and more. We’ll also talk about how we accomplish all this with a global view of our systems with M3, our open source metrics platform. We’ll take a deep dive look at how we use M3DB, now available as an open source Prometheus long term storage backend, to horizontally scale our metrics platform in a cost efficient manner with a system that’s still sane to operate with petabytes of metrics data.
PHP Detroit -- MySQL 8 A New Beginning (updated presentation)Dave Stokes
MySQL has many new features including a true data dictionary, better JSON support, histograms, roles, true descending indexes, 3d GIS, invisible indexes, and the default character set is UTF8MB4
In the glorious future, cancer will be cured, world hunger will solved and all because everything was directly instrumented for Prometheus. Until then however, we need to write exporters. This talk will look at how to go about this and all the tradeoffs involved in writing a good exporter.
Six Sigma ReportHammettSix Sigma DMAIC Project Report Templa.docxwhitneyleman54422
Six Sigma Report
HammettSix Sigma DMAIC Project Report Template
Comments
· The following template provides guidelines for preparing a Six Sigma written certification project report. Subheadings and length of each section will obviously vary based on your findings and writing style. For a complete sample report using the template, see “Sample Project Report”.
· The information in your report should follow the Six-Sigma Problem Solving Methodology DMAIC. This includes a description of the project, key points in the problem-solving process, and detailed support for your conclusions and any recommendations. Reports should be approximately 10-12 double-spaced pages (excluding appendices), including tables and figures.
· Some general guidelines for grammar and format are provided for your reference at the end.
· Some information contained in this template is repetitive across sections. However, since different audiences will read your report to various degrees of depth, we believe that it is essential to repeat certain information. Ultimately, you should produce a high quality, professionally-presented report that has sufficient detail to help other Six Sigma practitioners utilize and build upon your project findings.
Title of Report
Submitted to:
Name, Title
Department/Organization
Address (optional)
Prepared by:
Name, Title
Department/Organization
Address (optional)
Date Submitted
Note: Do not put a page number on your title page. Begin numbering the pages with the Executive Summary.
Executive Summary
The Executive Summary presents the major information contained in the report. Its readers are typically managers who need a broad understanding of the project and how it fits into a coherent whole. These readers do not need or want a detailed understanding of the various steps taken to complete your project. Therefore, the Executive Summary allows readers to learn the gist of the report without reading the entire document, to determine whether the report is relevant to their needs, or to get an overview before focusing on the details. We consider writing a concise (typically one-page) and comprehensive executive summary a critical element of a Six Sigma project report. TheExecutive Summary should NOT include terms, abbreviations, or symbols unfamiliar to the reader. Readers should understand the content of the Executive Summary without reading the rest of the report.
The Executive Summary should include a problem statement, summary of approach used, and major project findings and recommendations.
· Problem Statement / Description
· Concisely describe the problem (few sentences).
· Identify the time period of the problem
· Quantify the degree of the problem and its impact on the business (if possible).
Example: During the past year, the average # of incoming calls with complaints per card-year has increased by 20%. These additional calls have resulted in additional staffing and facility costs of ~$100K per year. This proj.
Distributing Queries the Citus Way | PostgresConf US 2018 | Marco SlotCitus Data
Citus is a sharding extension for postgres that can efficiently distribute a wide range of SQL queries. It uses postgres' planner hook to transparently intercept and plan queries on "distributed" tables. Citus then executes the queries in parallel across many servers, in a way that delegates most of the heavy lifting back to postgres.
Within Citus, we distinguish between several types of SQL queries, which each have their own planning logic:
Local-only queries
Single-node “router” queries
Multi-node “real-time” queries
Multi-stage queries
Each type of query corresponds to a different use case, and Citus implements several planners and executors using different techniques to accommodate the performance requirements and trade-offs for each use case.
This talk will discuss the internals of the different types of planners and executors for distributing SQL on top of postgres, and how they can be applied to different use cases.
This slidedeck from Giragadurai covers the following topics:
• Cloud Native (Scalability, Fault Tolerance, Load Balancing, Security and Monitoring Data)
• Micro Services (Concept, Containers & Inter-Communication Patterns)
• Big Data (Concept, Lambda Architecture)
• IoT Applications (Concept, Device, Data Acquisition, Edge & Deep Processing/Analytics)
Counting Unique Users in Real-Time: Here's a Challenge for You!DataWorks Summit
Finding the number of unique users out of 10 billion events per day is challenging. At this session, we're going to describe how re-architecting our data infrastructure, relying on Druid and ThetaSketch, enables our customers to obtain these insights in real-time.
To put things into context, at NMC (Nielsen Marketing Cloud) we provide our customers (marketers and publishers) real-time analytics tools to profile their target audiences. Specifically, we provide them with the ability to see the number of unique users who meet a given criterion.
Historically, we have used Elasticsearch to answer these types of questions, however, we have encountered major scaling and stability issues.
In this presentation we will detail the journey of rebuilding our data infrastructure, including researching, benchmarking and productionizing a new technology, Druid, with ThetaSketch, to overcome the limitations we were facing.
We will also provide guidelines and best practices with regards to Druid.
Topics include :
* The need and possible solutions
* Intro to Druid and ThetaSketch
* How we use Druid
* Guidelines and pitfalls
As a data mining application, Clementine offers a strategic approach to finding useful relationships in large datasets.
Clementine provides a wide range of data mining techniques, along with pre-built vertical solutions, in an integrated and comprehensive manner, with a special focus on visualization and case-of-use.
Working with Clementine is a three-step process of working with data.
• First, you read data into Clementine,
• Then, run the data through a series of manipulations,
• And finally, send the data to a destination.
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...NETWAYS
At Uber we use high cardinality monitoring to observe and detect issues with our 4,000 microservices running on Mesos and across our infrastructure systems and servers. We’ll cover how we put the resulting 6 billion plus time series to work in a variety of different ways, auto-discovering services and their usage of other systems at Uber, setting up and tearing down alerts automatically for services, sending smart alert notifications that rollup different failures into individual high level contextual alerts, and more. We’ll also talk about how we accomplish all this with a global view of our systems with M3, our open source metrics platform. We’ll take a deep dive look at how we use M3DB, now available as an open source Prometheus long term storage backend, to horizontally scale our metrics platform in a cost efficient manner with a system that’s still sane to operate with petabytes of metrics data.
PHP Detroit -- MySQL 8 A New Beginning (updated presentation)Dave Stokes
MySQL has many new features including a true data dictionary, better JSON support, histograms, roles, true descending indexes, 3d GIS, invisible indexes, and the default character set is UTF8MB4
In the glorious future, cancer will be cured, world hunger will solved and all because everything was directly instrumented for Prometheus. Until then however, we need to write exporters. This talk will look at how to go about this and all the tradeoffs involved in writing a good exporter.
Six Sigma ReportHammettSix Sigma DMAIC Project Report Templa.docxwhitneyleman54422
Six Sigma Report
HammettSix Sigma DMAIC Project Report Template
Comments
· The following template provides guidelines for preparing a Six Sigma written certification project report. Subheadings and length of each section will obviously vary based on your findings and writing style. For a complete sample report using the template, see “Sample Project Report”.
· The information in your report should follow the Six-Sigma Problem Solving Methodology DMAIC. This includes a description of the project, key points in the problem-solving process, and detailed support for your conclusions and any recommendations. Reports should be approximately 10-12 double-spaced pages (excluding appendices), including tables and figures.
· Some general guidelines for grammar and format are provided for your reference at the end.
· Some information contained in this template is repetitive across sections. However, since different audiences will read your report to various degrees of depth, we believe that it is essential to repeat certain information. Ultimately, you should produce a high quality, professionally-presented report that has sufficient detail to help other Six Sigma practitioners utilize and build upon your project findings.
Title of Report
Submitted to:
Name, Title
Department/Organization
Address (optional)
Prepared by:
Name, Title
Department/Organization
Address (optional)
Date Submitted
Note: Do not put a page number on your title page. Begin numbering the pages with the Executive Summary.
Executive Summary
The Executive Summary presents the major information contained in the report. Its readers are typically managers who need a broad understanding of the project and how it fits into a coherent whole. These readers do not need or want a detailed understanding of the various steps taken to complete your project. Therefore, the Executive Summary allows readers to learn the gist of the report without reading the entire document, to determine whether the report is relevant to their needs, or to get an overview before focusing on the details. We consider writing a concise (typically one-page) and comprehensive executive summary a critical element of a Six Sigma project report. TheExecutive Summary should NOT include terms, abbreviations, or symbols unfamiliar to the reader. Readers should understand the content of the Executive Summary without reading the rest of the report.
The Executive Summary should include a problem statement, summary of approach used, and major project findings and recommendations.
· Problem Statement / Description
· Concisely describe the problem (few sentences).
· Identify the time period of the problem
· Quantify the degree of the problem and its impact on the business (if possible).
Example: During the past year, the average # of incoming calls with complaints per card-year has increased by 20%. These additional calls have resulted in additional staffing and facility costs of ~$100K per year. This proj.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
3. But ML likes vectors not graphs...
ML needs this: [1, 6, 33.3, ‘product 5’, ‘opened notification’]
Not this:
4. Problem
● Prior to training machine learning models we need to generate features
● ML models need vectors of data not fragmented tables scattered across the enterprise
database
● Vectorizing many tables requires table joins, aggregate functions, etc.
● As the number of features grows, the likelihood for one off boilerplate code grows (e.g.,
joins, group bys, etc.)
● Additionally, many features may share the same tables
● Without a reusable, composable interface for building and automating features technical
debt and system complexity increases.
● Note: Feature stores solve some of these problems, but they take a different route
5. Why does feature engineering complexity matter?
● The failure rate of AI projects is high (85%), therefore experiment speed matters.
○ https://www.gartner.com/en/newsroom/press-releases/2018-02-13-gartner-says-nearly-half-of-cios-ar
e-planning-to-deploy-artificial-intelligence
● The cost of AI projects is high, therefore the reusability, extensibility, and production
readiness of high importance.
○ https://www.phdata.io/blog/what-is-the-cost-to-deploy-and-maintain-a-machine-learning-model/
○ Bare bones without MLOps: $60K
○ With MLOps for 1 model: $95K
● The talent shortage exacerbates the aforementioned
○ https://www.forbes.com/sites/forbestechcouncil/2022/10/11/the-data-science-talent-gap-why-it-exists-
and-what-businesses-can-do-about-it/?sh=3c63f6f23982
● Summary: If you don’t care, your boss does. If they don’t care, their boss does
6. How do we vectorize
The customer data
graph?
● Customer
○ N 100,000
● Orders
○ 2N = 200,000
● Order Events
○ 6N = 600,000
● Order Products
○ 10N = 1,000,000
● Notifications
○ 1000N = 100,000,000
● Notification Interactions
○ N^2 = 10,000,000,000
7. How do I update, modify, and maintain this?
select c.id as customer_id, nots.num_notifications,
nots.total_interactions,
nots.avg_interactions,
nots.max_interactions,
nots.min_interactions
from customers c
left join
(
select n.customer_id, count(n.id) as num_notifications, sum(ni.num_interactions) as
total_interactions,
avg(ni.num_interactions) as avg_interactions,
max(ni.num_interactions) as max_interactions,
min(ni.num_interactions) as min_interactions
from notifications n
left join
(
select notification_id,
count(id) as num_interactions
from notification_interactions
group by notification_id
) ni
on n.id = ni.notification_id
group by n.customer_id
) nots
on c.id = nots.customer_id
left join
(
select o.id as order_id, o.customer_id,
oe.num_order_events,
oe.num_type_events
from orders o
left join
(
select order_id,
count(id) as num_order_events,
sum(case when event_type_id = 1 then 1 else 0 end) as
num_type_events
from order_events
group by order_id
) oe
on o.id = oe.order_id
left join (
select order_id,
count(id) as num_order_products,
sum(case when product_type_id = 5 then 1 else 0 end) as
num_expensive_products,
sum(product_price) as product_price_sum,
max(product_price)-min(product_price) as product_price_range
from order_products
group by order_id
) op
on o.id = op.order_id
) ods
on c.id = ods.customer_id
where c.is_high_value = 1
and c.is_test = 0
and c.some_other_filter = 'yes';
8. What about orientation in time?
WHERE some_col >= ‘YYYY-MM-DD’ AND some_col < ‘YYYY-MM-DD’
…
…
select order_id,
count(id) as num_order_products,
sum(case when product_type_id = 5 then 1 else 0 end) as num_expensive_products,
sum(product_price) as product_price_sum,
max(product_price)-min(product_price) as product_price_range
from order_products
where ts > '2023-01-01' and ts < '2023-05-01'
Python: df.filter[df[col] >= ‘YYYY-MM-DD’ AND some_col < ‘YYYY-MM-DD’]
9. Solution
● Needs to work for tabular data
● Batch ML training, not solving for the online case. If you need online feature engineering
maybe consider feature stores.
● Need the following:
○ Reusable and scalable to many tables
○ Composable interface for switching tables used and feature vector computed
○ Orientation in time
○ Abstractions for repetitively implemented logic, such as joins, group bys, filters, etc.
○ Ability to support multiple feature definitions for the same table
○ Production ready interface, no changes between experimentation and production
MLOps
○ Ability to plug with multiple compute backends and extend easily to new backends
● Must flatten arbitrarily large enterprise data graphs
10. Solution 2
● Graphs can serve as the data structure for this problem by representing
tables as nodes and foreign keys as edges.
● By leveraging graph data structures we can plug into existing open source:
○ https://github.com/networkx
○ https://github.com/WestHealth/pyvis
● Some other companies have taken this approach with GNNs
○ https://kumo.ai
12. GraphReduce
● GraphReduce
○ Top-level class that subclasses nx.DiGraph and defines abstractions for
○ Cut dates: the data around which to orient the data
○ Consideration period: the amount of time to consider
○ Compute layer: the compute layer to use
○ Abstractions for enforcing naming conventions and sequence
○ Edges between nodes and edge metadata (e.g., cardinality between nodes)
○ Compute graph specifications, such as whether to reduce a node or not
● GraphReduceNode
○ Custom class for each node, which allows parameterization of the following:
■ Primary key
■ Date key
■ File path
■ File format
■ Compute layer
■ prefix
14. Case Study: FreightVerify https://freightverify.com
● Customer:
○ Automotive supply chain monitoring SaaS solution called FreightVerify with over 50 million shipments tracked and
10s of billions of events received from carriers. Customer receives billions of coordinate updates per year and
produces billions of ETAs (estimated time of arrival) for their customers’ supply chain.
● Problem:
○ Current models are more than 3 months stale and data sizes have outgrown technological capabilities. Build a
machine learning operations solution with feature engineering pipelines for all current and future ETA model
architectures, and extensible enough for other model architectures outside of just ETA.
● Solution:
○ After digesting the customer’s data layer, built a Spark-based feature engineering solution, with graph architecture,
which abstracted most map/reduce operations, joins, filters, and annotations for feature engineering on more than
20 tables.
● Results:
Allowed for rapid build, test, deployment, and product integration of over 50 models. The time to market for new models is
drastically reduced, performance of models increased, and operational complexity was reduced. The customer is able to have
up-to-date models rebuilt daily for quick reactivity to changing global supply chain conditions.
15. Next steps
● Reducing boilerplate code required
● Supporting automated feature engineering on undefined nodes
● Dynamic upward propagation of aggregated features
● Potential integration with fugue: https://github.com/fugue-project/fugue
● Enhancements to visualization, graph serialization, and tracking
● Integration with other projects