DataEngConf SF16 - Data Asserts: Defensive Data Science

•

3 likes•1,086 views

1) Complex data pipelines can introduce bugs that compound as dependencies increase. Engineers manage complexity through encapsulation, clear APIs, and integration tests. 2) Data scientists require semantic correctness but making assumptions introduces risks. Sanity checks on fields like verifying formats and constraints help identify potential errors. 3) Defensive data science through data asserts maintains quality by clearly defining trust boundaries and assumptions. Checks should match expectations and be revisited regularly as upstream changes can impact pipelines.

Technology

Data Asserts
Defensive Data Science
Tommy Guy
Microsoft

Our pipeline:
DATA!!!
Insight! Direction! Strategy!

Our pipeline in reality: bugs tend to compound
DATA!!!

How do Engineers Manage Complexity?
Encapsulate: create functions/classes/subsystems
with clear APIs. This helps isolate complexity
Integration Tests: ensure that the components interact
correctly. This helps identify breaking changes.

Data introduces a few complications
Pipelines take many upstream dependencies
Researcher use cases are frequently unknown and
unanticipated by data providers.
Pushing requirements upstream to all producers is
Sisyphean.

We are not talking about data pipeline tests
The data pipeline teams:
Are all rows that are produced stored
• Counter fields to ensure no dropped rows
• Sentinel events to measure join fidelity
Are availability SLAs being met?
• Progressive server-client merging

Data Scientists Require Semantic Correctness
Does this field mean what I think it does?

How do Data Scientists identify potential
errors?

How do Data Scientists identify potential
errors?
Some follow-on fact is absurd…
… which leads to investigation …
… which finds a broader problem
If [potential conclusion], then we must have 3 billion
OneDrive users…
… because my user table doesn’t have a primary key …
… so I should aggregate by user.

What are your Assumptions?
If I conclude “Users who upload files to OneDrive are XXX% more likely
to buy Office if they also sent mail through Mobile Outlook”, I’m
making many silent assumptions:
Field Assumptions
User Id • Logged and PII-encrypted similarly in Outlook and OneDrive
• Correctly logging timestamp for Office purchase
• User Id isn’t empty or missing
OneDrive activity • Wasn’t automated traffic [identified by a certain flag].
Email Activity • Mobile client identifiers are correct.
All • Any upstream changes to OneDrive, Office, or Exchange
data have been communicated to pipeline owners.

What are your Sanity Checks?
• If a column “OfficeId” is really a user id, it has certain known properties:
• Observation: these sorts of checks take place when the pipeline is set
up, but they may not be re-checked very often.
Assumption Why does it matter?
Never null/empty Causes job-breaking data skew issues
Users are 1:* with Tenants Logical constraint: sign you are missing something.
Very high cardinality If this isn’t true, it’s unlikely that it’s a user-id.
All rows in event data join to it Otherwise, your data is incomplete.
Matches a certain regex Sanity check: if this isn’t true, it’s unlikely that it’s a
user-id.

These should
match!
Data Asserts: Defensive Data Science

Data Asserts in Production: A few
Observations
• Most of the analysis-impacting assertion failures we’ve seen were
actually errors in our assumptions not errors in the pipeline.
• Good tests beget good code: we’ve had to modularize our code in
order to produce testable chunks that get re-used in pipelines.
• Data Asserts is the backbone to data provenance. A data conclusion
can directly link all of the assumptions about the input that we made.

To rephrase an old saying: ‘It takes a village to raise an Analyst.’ Data Analysts and Scientists are working in teams delivering insight and analysis on an ongoing basis. So how do you get the team to support experimentation and insight delivery without ending up in an IT Engineer vs Analyst vs Data Governance war? We present 5 shocking steps to get these teams of people working together with practical, doable steps that can help you achieve data agility.

Consolidating MLOps at One of Europe’s Biggest Airports

Databricks

At Schiphol airport we run a lot of mission critical machine learning models in production, ranging from models that predict passenger flow to computer vision models that analyze what is happening around the aircraft. Especially now in times of Covid it is paramount for us to be able to quickly iterate on these models by implementing new features, retraining them to match the new dynamics and above all to monitor them actively to see if they still fit the current state of affairs. To achieve those needs we rely on MLFlow but have also integrated that with many of our other systems. So have we written Airflow operators for MLFlow to ease the retraining of our models, have we integrated MLFlow deeply with our CI pipelines and have we integrated it with our model monitoring tooling. In this talk we will take you through the way we rely on MLFlow and how that enables us to release (sometimes) multiple versions of a model per week in a controlled fashion. With this set-up we are achieving the same benefits and speed as you have with a traditional software CI pipeline.

Join2017_Deep Dive_AWS Operations

Looker

Real time analytics @ netflix

Cody Rioux

Application Quality Gates in Continuous Delivery: Deliver Better Software Fas...

Andreas Grabner

Hugs instead of Bugs: Dreaming of Quality Tools for Devs and Testers

Andreas Grabner

I have a Dream that Testers extend their horizon and toolsets and not only test for functional correctness but make a step towards what developers need in order to fix critical issues. I am talking about architectural, scalability and performance metrics such as # of JS Files on a page, Page Size, # of SQL Statements, # of Log Messages Written. If Testers start to capture this information as well and share it with their bug description I am sure it will both increase the value of testers as well as reduce the total time it takes to fix problems.

This is the presentation given for the Docker Meetup in Cordoba, Argentina. Recording should soon be up on http://www.meetup.com/Docker-Cordoba-ARG/events/226995018/ Key Takeaways: Pick your Metrics! Automate It! Fail Bad Builds Faster! Deliver Faster with Better Quality! To the Docker Audience my main point was that: Just adding Docker doesn't give you free performance and scalability of your app. I walk through many examples of failing apps. What are the metrics that highlight the problem and how to automatically detect bad builds by looking at these Metrics along your Pipeline.

Metrics & more

Stefan Thies

Dr. Datascience or: How I Learned to Stop Munging and Love Tests

Work-Bench

(BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Amazon Web Services

Netflix strives to provide an amazing experience to each member. To accomplish this, Netflix needs to maintain very high availability across our systems. However, at a certain scale, humans can no longer scale their ability to monitor the status of all systems, making it critical for Netflix to build tools and platforms that can automatically monitor their production environments and make intelligent real-time operational decisions to remedy the problems they identify. In this session, we discuss how Netflix uses data mining and machine learning techniques to automate decisions in real-time with the goal of supporting operational availability, reliability, and consistency. We review how we got to the current states, the lessons we learned, and the future of real-time analytics at Netflix. While Netflix's scale is larger than most other companies, we believe the approaches and technologies we discuss are highly relevant to other production environments, and audience members should come away with actionable ideas that are implementable in, and benefit, most other environments.

How to keep you out of the News: Web and End-to-End Performance Tips

Andreas Grabner

Too many websites make it too the news when they fail to deliver, e.g: eCommerce when they go down on Cyber Monday, Tax Software on Tax Day or Online Banking when people want to check on their latest pay check. In this presentation - presented at several Web Performance, Java, .NET, ... Meetups I walk through the most common performance mistakes people made in recent history. I explain in technical detail what the problem was and how to find these problems earlier as you dont want to wait until your site crashes and you end up in the news.

Database DevOps Anti-patterns

Alex Yates

Boston DevOps Days 2016: Implementing Metrics Driven DevOps - Why and How

Andreas Grabner

Nordstrom Customer Presentation

Splunk

DMCA#21: reactive-programming

Olivier Destrebecq

Web and App Performance: Top Problems to avoid to keep you out of the News

Andreas Grabner

As presented at Boston and NYC Web Perf Meetup. Its time to level up Web Performance Optimization started by Steve Souders. We need to look beyond the rim of the browser as there are many problems happenig from browser to database. In this presentation I showed how Browser Diagnostics needs to evolve into End-to-End Application Diagnostics and Monitoring. Showing 5 real life examples on why applications failed and the metrics to look at to identify these problems early on

BTD2015 - Your Place In DevTOps is Finding Solutions - Not Just Bugs!

Andreas Grabner

This is about leveling-up and REVOLUTIONIZING Testing as part of your Agile/DevOps Transformation. You can contribute more than testing functionality. You need to Level-Up your skill set by understanding the apps you are testing. # Images, # JS Files, # SQL Statements, Connection Pool Utilization and Garbage Collection Activity have to be added to your portfolio. Check these metrics when you do your functional testing and report regressions to your engineers even though the functionality is still good. But you just uncovered an Architectural regression that will lead to a scalabilty and performance problem. Finding these problems early will eliminate a lot of wasted and unplanned time later on in the lifecycle. that is your contribution to delivering software faster with better quality

Getting CI right for SQL Server

Alex Yates

This isn’t the dark ages any more. You’ve learned that you need to put your database in source control and you’re competent with source control systems like TFS or Git. You’ve also learned how to express your database in script form using a tool like SSDT, Flyway or Redgate. However, you haven’t really looked at the build functionality in VSTS yet or gotten to grips with build servers like TeamCity or Jenkins. You still haven’t written as many automated tests as you know you should. Even if you have you aren’t sure how the process should work for SQL Server builds and tests. In this session I’ll talk through the two most common ways to automate your database builds/deploys and the pros and cons of each. Then I'll show you how to use tSQLt to build a suite of tests that can be run as part of your build process, giving you confidence in the quality of your code.

Code Once Use Often with Declarative Data Pipelines

Databricks

Did you know 160,000,000,000 pounds of food ends up in North American landfills each year? Flashfood is helping reduce food waste by providing a mobile marketplace where grocers can sell food nearing its best before date. In 2020 alone Flashfood diverted 11.2 million pounds of food waste while saving shoppers 29 million dollars on groceries. To operate and optimize the marketplace, Flashfood ingests, processes, and surfaces a wide variety of data from the core application, partners, and external sources. As the volume, variety and velocity of sources and sinks proliferate, the complexity of scheduling and maintaining jobs increases in tandem. We noticed this complexity largely stemmed from different implementations of core ETL mechanics, rather than business logic itself. We’ve implemented declarative data pipelines following a mantra of ‘code once use often’ to solve for this complexity. We started by building a highly configurable Apache Spark application which is initialized with details of the source, file type, transformation, load destination, etc. We then used Airflow to extend on the DatabricksRunSubmitOperator which allowed us to customize the cluster and parameters used in execution. Finally, we used airflow-declartive to generate DAGs in YAML, enabling us to set configurations, instantiate jobs, and orchestrate execution in a human readable file. The declarative nature means less specialized personnel are able to set up an ETL with confidence, no longer requiring a deep knowledge of Apache Spark intricacies. Additionally, by ensuring that boilerplate logic was only implemented once, we reduced maintenance and increased delivery speed by 80%.

DevOps 101 for data professionals

Alex Yates

In 2009 John Allspaw and Paul Hammond delivered the session “10 deploys per day – Dev & ops cooperation at Flickr.” In forty six minutes they changed the way millions of people would think about the software delivery process for years to come. It didn’t have a name yet, but DevOps was born. DevOps folk preached about the cloud, automation, rapid delivery and any database technology that wasn’t relational… In 2013 Kenny Gorman declared “The DBA is Dead”. For the record, I don’t believe that, but a lot of people do. What is certain is that the world of IT is changing, and the traditional DBA role, and most other data roles, are changing with it. I’m going to explain what DevOps is, where it came from, and its implications for SQL Server. We’ll cover the human and technical basics of database DevOps – and I’m going to discuss some changes that data folk need to make.

HSPS 2015 - SharePoint Performance Santiy Checks

Andreas Grabner

Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...

Databricks

PixieDust is a new open source library that helps data scientists and developers working in Jupyter Notebooks and Apache Spark be more efficient. PixieDust speeds up data manipulation and display with features like: auto-visualization of Spark DataFrames, real-time Spark job progress monitoring, automated local install of Python and Scala kernels running with Spark, and much more. Come along and learn how you can use this tool in your own projects to visualize and explore data effortlessly with no coding. Oh, and if you prefer working with a Scala Notebook, this session is also for you, as PixieDust can also run on a Scala Kernel. Imagine being able to visualize your favorite Python chart engines from a Scala Notebook! We’ll finish the session with a demo combining Twitter, Watson Tone Analyzer, Spark Streaming, and some fun real-time visualizations–all running within a Notebook.

Machine learning model to production

Georg Heiler

node-crate: node.js and big data

Stefan Thies

node-crate: node.js & big data This presentation provides 'lessons learned' from project implementations with various technologies like Elasticsearch or MongoDB and describes how using Crate data store solved the key issues. The second part introduces CRATE data store and 'node-crate' by examples for development and operation. About Crate: Crate is a new breed of database to serve today's mammoth data needs. Based on the familiar SQL syntax, Crate combines high availability, resiliency, and scalability in a distributed design that allows you to query mountains of data in realtime, not batches. We solve your data scaling problems and make administration a breeze. Easy to scale, simple to use.

Netflix Data Engineering @ Uber Engineering Meetup

Blake Irvine

Building Scalable Prediction Services in R

Work-Bench

DataEngConf SF16 - Beginning with Ourselves

Hakka Labs

DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability

Hakka Labs

What's hot

Docker/DevOps Meetup: Metrics-Driven Continuous Performance and Scalabilty

Andreas Grabner

Metrics & more

Stefan Thies

Dr. Datascience or: How I Learned to Stop Munging and Love Tests

Work-Bench

(BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Amazon Web Services

How to keep you out of the News: Web and End-to-End Performance Tips

Andreas Grabner

Database DevOps Anti-patterns

Alex Yates

Boston DevOps Days 2016: Implementing Metrics Driven DevOps - Why and How

Andreas Grabner

Nordstrom Customer Presentation

Splunk

DMCA#21: reactive-programming

Olivier Destrebecq

Web and App Performance: Top Problems to avoid to keep you out of the News

Andreas Grabner

BTD2015 - Your Place In DevTOps is Finding Solutions - Not Just Bugs!

Andreas Grabner

Getting CI right for SQL Server

Alex Yates

Code Once Use Often with Declarative Data Pipelines

Databricks

DevOps 101 for data professionals

Alex Yates

HSPS 2015 - SharePoint Performance Santiy Checks

Andreas Grabner

Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...

Databricks

Machine learning model to production

Georg Heiler

node-crate: node.js and big data

Stefan Thies

Netflix Data Engineering @ Uber Engineering Meetup

Blake Irvine

Building Scalable Prediction Services in R

Work-Bench

What's hot (20)

Docker/DevOps Meetup: Metrics-Driven Continuous Performance and Scalabilty

Metrics & more

Dr. Datascience or: How I Learned to Stop Munging and Love Tests

(BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

How to keep you out of the News: Web and End-to-End Performance Tips

Database DevOps Anti-patterns

Boston DevOps Days 2016: Implementing Metrics Driven DevOps - Why and How

Nordstrom Customer Presentation

DMCA#21: reactive-programming

Web and App Performance: Top Problems to avoid to keep you out of the News

BTD2015 - Your Place In DevTOps is Finding Solutions - Not Just Bugs!

Getting CI right for SQL Server

Code Once Use Often with Declarative Data Pipelines

DevOps 101 for data professionals

HSPS 2015 - SharePoint Performance Santiy Checks

Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...

Machine learning model to production

node-crate: node.js and big data

Netflix Data Engineering @ Uber Engineering Meetup

Building Scalable Prediction Services in R

Viewers also liked

DataEngConf SF16 - Beginning with Ourselves

Hakka Labs

DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability

Hakka Labs

DataEngConf SF16 - High cardinality time series search

Hakka Labs

DataEngConf SF16 - Collecting and Moving Data at Scale

Hakka Labs

DataEngConf SF16 - Running simulations at scale

Hakka Labs

DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...

Hakka Labs

DataEngConf SF16 - Recommendations at Instacart

Hakka Labs

DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Hakka Labs

Always Valid Inference (Ramesh Johari, Stanford)

Hakka Labs

DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ

Hakka Labs

DataEngConf SF16 - Scalable and Reliable Logging at Pinterest

Hakka Labs

DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data

Hakka Labs

DataEngConf SF16 - Bridging the gap between data science and data engineering

Hakka Labs

DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...

Hakka Labs

DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data

Hakka Labs

DataEngConf SF16 - Multi-temporal Data Structures

Hakka Labs

Viewers also liked (16)

DataEngConf SF16 - Beginning with Ourselves

DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability

DataEngConf SF16 - High cardinality time series search

DataEngConf SF16 - Collecting and Moving Data at Scale

DataEngConf SF16 - Running simulations at scale

DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...

DataEngConf SF16 - Recommendations at Instacart

DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Always Valid Inference (Ramesh Johari, Stanford)

DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ

DataEngConf SF16 - Scalable and Reliable Logging at Pinterest

DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data

DataEngConf SF16 - Bridging the gap between data science and data engineering

DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...

DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data

DataEngConf SF16 - Multi-temporal Data Structures

Similar to DataEngConf SF16 - Data Asserts: Defensive Data Science

Measuring Data Quality with DataOps

Steven Ensslen

Most organisations think that they have poor data quality, but don’t know how to measure it or what to do about it. Teams of data scientists, analysts, and ETL developers are either blindly taking a “garbage in -> garbage out” approach, or worse still, “cleansing” data to fit their limited perspectives. DataOps is a systematic approach to measuring data and for planning mitigations for bad data.

IT Operation Analytic for security- MiSSconf(sp1)

stelligence

Data Quality

Vijaya K

Building the enterprise data architectureCosta Pissaris

Data quality and data profiling

Shailja Khurana

End User InformaticsAmbareesh Kulkarni

BI on Big Data Presentation

Arcadia Data

Shant Hovsepian, CTO of Arcadia Data and a panel of experts details the trade-offs between a number of architectures that provide self-service access to data, and industry researcher Mark Madsen discusses the pros and cons of architectures, deployment strategies, and customer examples of BI on big data. Topics include: - Traditional BI platforms based on semantic layers and SQL/MDX generation - Server and desktop BI tools based on direct mapping of data - Distributed BI platforms (e.g., MPP and data native) - OLAP- and SQL-on-Hadoop engines

Creating a Data validation and Testing Strategy

RTTS

Creating A Data Validation & Testing Strategy Are you struggling with formulating a strategy for how to validate the massive amount of data continuously entering your data warehouse or data lake? We can help you! Learn how RTTS’ Data Validation Assessment provides: - an evaluation of your current data validation process - recommendations on how to improve your process and - a proposal for successful implementation This slide deck addresses the following issues: - How do I find out if I have bad data? - How do I ensure I am testing the proper data permutations? - How much of my data needs to be validated and automated? - Which critical data endpoints need to be tested? - How do I test data in my cloud environments? And much more! For more information, visit: https://www.rttsweb.com/services/solutions/data-validation-assessment

MLOps and Data Quality: Deploying Reliable ML Models in Production

Provectus

Looking to build a robust machine learning infrastructure to streamline MLOps? Learn from Provectus experts how to ensure the success of your MLOps initiative by implementing Data QA components in your ML infrastructure. For most organizations, the development of multiple machine learning models, their deployment and maintenance in production are relatively new tasks. Join Provectus as we explain how to build an end-to-end infrastructure for machine learning, with a focus on data quality and metadata management, to standardize and streamline machine learning life cycle management (MLOps). Agenda - Data Quality and why it matters - Challenges and solutions of Data Testing - Challenges and solutions of Model Testing - MLOps pipelines and why they matter - How to expand validation pipelines for Data Quality

Qiagram

jwppz

How to improve your system monitoring

Andrew White

Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...

Data Con LA

Curtis ODell, Global Director Data Integrity at Tricentis Join me to learn about a new end-to-end data testing approach designed for modern data pipelines that fills dangerous gaps left by traditional data management toolsâ€”one designed to handle structured and unstructured data from any source. You'll hear how you can use unique automation technology to reach up to 90 percent test coverage rates and deliver trustworthy analytical and operational data at scale. Several real world use cases from major banks/finance, insurance, health analytics, and Snowflake examples will be presented. Key Learning Objective 1. Data journeys are complex and you have to ensure integrity of the data end to end across this journey from source to end reporting for compliance 2. Data Management tools do not test data, they profile and monitor at best, and leave serious gaps in your data testing coverage 3. Automation with integration to DevOps and DataOps' CI/CD processes are key to solving this. 4. How this approach has impact in your vertical

Automate data warehouse etl testing and migration testing the agile way

Torana, Inc.

Pragmatics Driven Issues in Data and Process Integrity in Enterprises

Amit Sheth

DGIQ 2015 The Fundamentals of Data Quality

Caserta

Starting Your DevOps Journey – Practical Tips for Ops

Dynatrace

To watch, please see: https://info.dynatrace.com/apm_wc_getting_started_with_devops_na_registration.html Starting Your DevOps Journey: Practical Tips for Ops In this webinar, Andreas Grabner, Chief DevOps Activist at Dynatrace, shares practical tips that all IT groups from Dev to Ops can use to start their DevOps journey quickly. With experience from hundreds of DevOps deployments, Andi provides insights it would take your team months or years to learn firsthand. - Learn how everyone on your Ops team can use APM to better understand and monitor SLAs, Performance and End User Impact of their applications. - Foster better collaboration between Ops and architects by extending basic system monitoring to monolith and microservices architectures. - Shift-left your testing and QA by working with metrics that you and the architects agreed on up front, resulting in early relevant feedback and faster code deployments. - Hear why changing the cultural mindset from “fear of change” to “Continuous Innovation and Optimization” is critical for success. Andi is joined by guest speaker, Brian Chandler, Systems Engineer at Raymond James, who shares commonly used Ops dashboards that increase collaboration across IT teams and pro-actively break down silos!

Use of Formal Methods at Amazon Web Services

Sulman Ahmed

Data Quality: principles, approaches, and best practices

Carl Anderson

Predictive Analytics - Big Data Warehousing Meetup

Caserta

Predictive analytics has always been about the future, and the age of big data has made that future an increasingly dynamic place, filled with opportunity and risk. The evolution of advanced analytics technologies and the continual development of new analytical methodologies can help to optimize financial results, enable systems and services based on machine learning, obviate or mitigate fraud and reduce cybersecurity risks, among many other things. Caserta Concepts, Zementis, and guest speaker from FICO presented the strategies, technologies and use cases driving predictive analytics in a big data environment. For more information, visit www.casertaconcepts.com or contact us at info@casertaconcepts.com

Automatic Data Reconciliation, Data Quality, and Data Observability.pdf

4dalert

Similar to DataEngConf SF16 - Data Asserts: Defensive Data Science (20)

Measuring Data Quality with DataOps

IT Operation Analytic for security- MiSSconf(sp1)

Data Quality

Building the enterprise data architecture

Data quality and data profiling

End User Informatics

BI on Big Data Presentation

Creating a Data validation and Testing Strategy

MLOps and Data Quality: Deploying Reliable ML Models in Production

Qiagram

How to improve your system monitoring

Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...

Automate data warehouse etl testing and migration testing the agile way

Pragmatics Driven Issues in Data and Process Integrity in Enterprises

DGIQ 2015 The Fundamentals of Data Quality

Starting Your DevOps Journey – Practical Tips for Ops

Use of Formal Methods at Amazon Web Services

Data Quality: principles, approaches, and best practices

Predictive Analytics - Big Data Warehousing Meetup

Automatic Data Reconciliation, Data Quality, and Data Observability.pdf

More from Hakka Labs

DataEngConf SF16 - Spark SQL Workshop

Hakka Labs

DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...

Hakka Labs

DataEngConf: Data Science at the New York Times by Chris Wiggins

Hakka Labs

DataEngConf: Building the Next New York Times Recommendation Engine

Hakka Labs

By Alex Spangher (Data Engineer, New York Times Digital) Machine Learning is a discipline characterized by systematic approaches and common threads to seemingly diverse problems. In this talk I'll talk about several approaches taken during our work on the next New York Times Recommendation Engine, specifically focusing on spatial reasoning, dimensionality reduction, and testing strategies. Topics covered will include implicit regression, Bayesian modeling and neural networks. The talk will focus on the commonalities between different approaches taken.

DataEngConf: Measuring Impact with Data in a Distributed World at Conde Nast

Hakka Labs

By Ky Harlin (VP Growth & Data Science, Conde Nast) The nature of digital content is more distributed than ever before, and measuring its impact presents different challenges than traditional web analytics. How does one manage measurement and analysis in this environment to create meaningful feedback loops for a media company? At its core, this is an engineering problem, but it requires close collaboration with data scientists, content creators, editors, and others in order to be effective. Through real examples from his work at Conde Nast, Ky will review how they have approached this problem and interesting findings from their early work.

DataEngConf: Feature Extraction: Modern Questions and Challenges at Google

Hakka Labs

By Dmitry Storcheus (Engineer, Google Research) Feature extraction, as usually understood, seeks an optimal transformation from raw data into features that can be used as an input for a learning algorithm. In recent times this problem has been attacked using a growing number of diverse techniques that originated in separate research communities: from PCA and LDA to manifold and metric learning. The goal of this talk is to contrast and compare feature extraction techniques coming from different machine learning areas as well as discuss the modern challenges and open problems in feature extraction. Moreover, this talk will suggest novel solutions to some of the challenges discussed, particularly to coupled feature extraction.

DataEngConf: Talkographics: Using What Viewers Say Online to Measure TV and B...

Hakka Labs

By Shawndra Hill (Sr Researcher, Microsoft Research NYC) Viewers of TV shows are increasingly taking to online sites like Facebook and Twitter to comment about the shows they watch as well as to contribute content about their daily lives. We present a novel recommendation system (RS) based on the user-generated content (UGC) contributed by TV viewers via the social networking site Twitter. In our approach, a TV show is represented by all of the tweets of its viewers who follow the show on Twitter. These tweets, in aggregate, enable us to reliably calculate the affinity between TV shows and to describe how and why certain shows are similar in terms of their audiences in a privacy friendly way.

DataEngConf: The Science of Virality at BuzzFeed

Hakka Labs

By Adam Kelleher (Sr Data Scientist, BuzzFeed) BuzzFeed has developed the technology to attribute pageviews to a referring user. Using these data, we can construct diffusion graphs for our articles. These graphs introduce a whole collection of new performance metrics, and their complexity opens the door for a new assortment of complications to go with them. I'll mention some past work that has been done to create similar data from old pageview events. Then, I'll work through how we process these data into graph objects (avoiding some pitfalls), and mention some of the new ways of looking at web analytics implied by these objects. I'll talk about how we can take advantage of the structure of these objects to make certain algorithms more efficient. Finally, I'll cover some of the future applications we're particularly excited about!

DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...

Hakka Labs

DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data

Hakka Labs

By Doug Daniels (Director of Engineering, Data Dog) At Datadog, we collect hundreds of billions of metric data points per day from hosts, services, and customers all over the world. In addition charting and monitoring this data in real time, we also run many large-scale offline jobs to apply algorithms and compute aggregations on the data. In the past months, we’ve migrated our largest data sets over to Apache Parquet—an efficient, portable columnar storage format

DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine ...

Hakka Labs

By Alan Gardner (Platform Engineer, Rocana) Rocana Ops is designed to handle terabytes a day of application logs and system metrics from across multiple data centres. We use Apache Kafka as a durable, high-throughput message bus at the centre of our application architecture. This talk will discuss the design and features of Kafka, its operational characteristics, and why we chose it as the backbone of our data pipeline.

DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn

Hakka Labs

By Nikolai Avteniev (Sr Software Engineer, LinkedIn) LinkedIn is the professional profile of record for our 370M+ members globally, but many people don't realize the full potential of their LinkedIn profile – especially on mobile. Adding blogs, photos and other rich content to your profile on a small screen device can get tedious. That's why LinkedIn created Satori, a Hadoop tool that crawls the web and extracts data to discover members' professional content online. Satori uses machine learning techniques and leverages other open source tools like Nutch and Gobblin in order to help match members with relevant content in order to maximize their professional profile. In this talk, Nikolai will share his experience in building the product and discuss the challenges and opportunities encountered along the way.

More from Hakka Labs (12)

DataEngConf SF16 - Spark SQL Workshop

DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...

DataEngConf: Data Science at the New York Times by Chris Wiggins

DataEngConf: Building the Next New York Times Recommendation Engine

DataEngConf: Measuring Impact with Data in a Distributed World at Conde Nast

DataEngConf: Feature Extraction: Modern Questions and Challenges at Google

DataEngConf: Talkographics: Using What Viewers Say Online to Measure TV and B...

DataEngConf: The Science of Virality at BuzzFeed

DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...

DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data

DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine ...

DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn

Recently uploaded

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf

FIDO Alliance

Free Complete Python - A step towards Data Science

RinaMondal9

Enhancing Performance with Globus and the Science DMZ

Globus

LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...

DanBrown980551

Do you want to learn how to model and simulate an electrical network from scratch in under an hour? Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)! During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook. PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides: - A fully editable and extendable library for grid component modelling; - Visualization tools to display your network; - Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses; The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well. What you will learn during the webinar: - For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills; - For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.

RESUME BUILDER APPLICATION Project for students

KAMESHS29

Generative AI Deep Dive: Advancing from Proof of Concept to Production

Aggregage

UiPath Test Automation using UiPath Test Suite series, part 4

DianaGray10

Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap. The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies. Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques What will you get from this session? 1. Insights into SAP testing best practices 2. Heatmap utilization for testing 3. Optimization of testing processes 4. Demo Topics covered: Execution from the test manager Orchestrator execution result Defect reporting SAP heatmap example with demo Speaker: Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...

James Anderson

Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management. The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM). Speakers: Bob Boule Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle. Gopinath Rebala Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.

FIDO Alliance Osaka Seminar: Overview.pdf

FIDO Alliance

Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™

UiPathCommunity

In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni. 📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath: Autopilot per Studio Web Autopilot per Studio Autopilot per Apps Clipboard AI GenAI applicata alla Document Understanding 👨‍🏫👨‍💻 Speakers: Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath Andrei Tasca, RPA Solutions Team Lead @NTT Data

Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx

nkrafacyberclub

UiPath Community Day Dubai: AI at Work..

UiPathCommunity

Welcome to the first live UiPath Community Day Dubai! Join us for this unique occasion to meet our local and global UiPath Community and leaders. You will get a full view of the MEA region's automation landscape and the AI Powered automation technology capabilities of UiPath. Also, hosted by our local partners Marc Ellis, you will enjoy a half-day packed with industry insights and automation peers networking. 📕 Curious on our agenda? Wait no more! 10:00 Welcome note - UiPath Community in Dubai Lovely Sinha, UiPath Community Chapter Leader, UiPath MVPx3, Hyper-automation Consultant, First Abu Dhabi Bank 10:20 A UiPath cross-region MEA overview Ashraf El Zarka, VP and Managing Director MEA, UiPath 10:35: Customer Success Journey Deepthi Deepak, Head of Intelligent Automation CoE, First Abu Dhabi Bank 11:15 The UiPath approach to GenAI with our three principles: improve accuracy, supercharge productivity, and automate more Boris Krumrey, Global VP, Automation Innovation, UiPath 12:15 To discover how Marc Ellis leverages tech-driven solutions in recruitment and managed services. Brendan Lingam, Director of Sales and Business Development, Marc Ellis

State of ICS and IoT Cyber Threat Landscape Report 2024 preview

Prayukth K V

The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development. The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers: State of global ICS asset and network exposure Sectoral targets and attacks as well as the cost of ransom Global APT activity, AI usage, actor and tactic profiles, and implications Rise in volumes of AI-powered cyberattacks Major cyber events in 2024 Malware and malicious payload trends Cyberattack types and targets Vulnerability exploit attempts on CVEs Attacks on counties – USA Expansion of bot farms – how, where, and why In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East Why are attacks on smart factories rising? Cyber risk predictions Axis of attacks – Europe Systemic attacks in the Middle East Download the full report from here: https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/

Epistemic Interaction - tuning interfaces to provide information for AI support

Alan Dix

Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024 https://alandix.com/academic/papers/synergy2024-epistemic/ As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf

91mobiles

Elizabeth Buie - Older adults: Are we really designing for our future selves?

Nexer Digital

Assure Contact Center Experiences for Your Customers With ThousandEyes

ThousandEyes

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...

Sri Ambati

The Future of Platform Engineering

Jemma Hussein Allen

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf

FIDO Alliance

Recently uploaded (20)

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf

Free Complete Python - A step towards Data Science

Enhancing Performance with Globus and the Science DMZ

LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...

RESUME BUILDER APPLICATION Project for students

Generative AI Deep Dive: Advancing from Proof of Concept to Production

UiPath Test Automation using UiPath Test Suite series, part 4

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...

FIDO Alliance Osaka Seminar: Overview.pdf

Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™

Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx

UiPath Community Day Dubai: AI at Work..

State of ICS and IoT Cyber Threat Landscape Report 2024 preview

Epistemic Interaction - tuning interfaces to provide information for AI support

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf

Elizabeth Buie - Older adults: Are we really designing for our future selves?

Assure Contact Center Experiences for Your Customers With ThousandEyes

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...

The Future of Platform Engineering

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf

DataEngConf SF16 - Data Asserts: Defensive Data Science

1. Data Asserts Defensive Data Science Tommy Guy Microsoft

2. Observation: Complexity In Pipeline

3. Our pipeline: DATA!!! Insight! Direction! Strategy!

4. Our pipeline in reality: bugs tend to compound DATA!!!

5. How do Engineers Manage Complexity? Encapsulate: create functions/classes/subsystems with clear APIs. This helps isolate complexity Integration Tests: ensure that the components interact correctly. This helps identify breaking changes.

6. Data introduces a few complications Pipelines take many upstream dependencies Researcher use cases are frequently unknown and unanticipated by data providers. Pushing requirements upstream to all producers is Sisyphean.

7. We are not talking about data pipeline tests The data pipeline teams: Are all rows that are produced stored • Counter fields to ensure no dropped rows • Sentinel events to measure join fidelity Are availability SLAs being met? • Progressive server-client merging

8. Data Scientists Require Semantic Correctness Does this field mean what I think it does?

9. How do Data Scientists identify potential errors?

10. How do Data Scientists identify potential errors? Some follow-on fact is absurd… … which leads to investigation … … which finds a broader problem If [potential conclusion], then we must have 3 billion OneDrive users… … because my user table doesn’t have a primary key … … so I should aggregate by user.

11. What are your Assumptions? If I conclude “Users who upload files to OneDrive are XXX% more likely to buy Office if they also sent mail through Mobile Outlook”, I’m making many silent assumptions: Field Assumptions User Id • Logged and PII-encrypted similarly in Outlook and OneDrive • Correctly logging timestamp for Office purchase • User Id isn’t empty or missing OneDrive activity • Wasn’t automated traffic [identified by a certain flag]. Email Activity • Mobile client identifiers are correct. All • Any upstream changes to OneDrive, Office, or Exchange data have been communicated to pipeline owners.

12. What are your Sanity Checks? • If a column “OfficeId” is really a user id, it has certain known properties: • Observation: these sorts of checks take place when the pipeline is set up, but they may not be re-checked very often. Assumption Why does it matter? Never null/empty Causes job-breaking data skew issues Users are 1:* with Tenants Logical constraint: sign you are missing something. Very high cardinality If this isn’t true, it’s unlikely that it’s a user-id. All rows in event data join to it Otherwise, your data is incomplete. Matches a certain regex Sanity check: if this isn’t true, it’s unlikely that it’s a user-id.

13. Data Asserts: Defensive Data Science

14. Data Asserts: Maintain Quality

15. Data Asserts: Clear Trust Boundaries

16. These should match! Data Asserts: Defensive Data Science

17. Data Asserts in Production: A few Observations • Most of the analysis-impacting assertion failures we’ve seen were actually errors in our assumptions not errors in the pipeline. • Good tests beget good code: we’ve had to modularize our code in order to produce testable chunks that get re-used in pipelines. • Data Asserts is the backbone to data provenance. A data conclusion can directly link all of the assumptions about the input that we made.

DataEngConf SF16 - Data Asserts: Defensive Data Science

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (16)

Similar to DataEngConf SF16 - Data Asserts: Defensive Data Science

Similar to DataEngConf SF16 - Data Asserts: Defensive Data Science (20)

More from Hakka Labs

More from Hakka Labs (12)

Recently uploaded

Recently uploaded (20)

DataEngConf SF16 - Data Asserts: Defensive Data Science