Elijah ben Izzy, Data Platform Engineer at StitchFix
Data Engineering
At Stitch Fix, data is integral to every facet of our business. We run a plethora of dataflows to transform raw data into features that models use to serve customers. We need to scale these dataflows both in code complexity, as we add new capabilities, and in data size, as we attain more customers. To ensure that these workflows don't devolve into an unmaintainable mess of spaghetti code (chalk full of in-place pandas operations), we built and open-sourced hamilton, a pluggable microframework to make scaling and managing complex dataflows easy. To use hamilton, one creates a dataflow by writing simple python functions in a declarative manner. The framework stitches them together, introducing an abstraction to configure and execute these dataflows. In this talk, we'll present the basic concepts of hamilton, discuss its impact at Stitch Fix, and share recent extensions to the project, including integrations with Dask, Spark, and Ray and why we're excited for its future.
Your admin toolbelt is not complete without Salesforce DXDaniel Stange
"Learn to Leverage and Love the aDmin eXperience"
Presentation by Christian Szandor Knapp and Daniel Stange, given at French Touch Dreamin' in Paris, November 14th, 2018
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Intel® Software
Integrated into Intel® Advisor, Cache-aware Roofline Modeling (CARM) provides insight into how an application behaves by helping to determine a) how optimally it works on a given hardware, b) the main factors that limit performance, c) if the workload is memory or compute-bound, and d) the right strategy to improve application performance.
A glimpse at some of the new features for the C++ programming languages that will be introduced by the upcoming C++17 Standard.
This talk was given at the Munich C++ User Group Meetup.
Your admin toolbelt is not complete without Salesforce DXDaniel Stange
"Learn to Leverage and Love the aDmin eXperience"
Presentation by Christian Szandor Knapp and Daniel Stange, given at French Touch Dreamin' in Paris, November 14th, 2018
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Intel® Software
Integrated into Intel® Advisor, Cache-aware Roofline Modeling (CARM) provides insight into how an application behaves by helping to determine a) how optimally it works on a given hardware, b) the main factors that limit performance, c) if the workload is memory or compute-bound, and d) the right strategy to improve application performance.
A glimpse at some of the new features for the C++ programming languages that will be introduced by the upcoming C++17 Standard.
This talk was given at the Munich C++ User Group Meetup.
How to build an ETL pipeline with Apache Beam on Google Cloud DataflowLucas Arruda
Nowadays more and more companies are searching for insights with potential to grow their business by analyzing large amounts of data from many different systems. However, in order to reach this level of Big Data Analysis it's necessary to build an ETL pipeline that allows us to process raw data coming from different sources into an appropriate format that is possible to use against visualization tools such as Tableau.
This kind of data processing can be done by a variety of tools and in this presentation I show how to do it by using an unified programming model created by Google and open-sourced as the name of Apache Beam. We will build a simple pipeline that will be executed in the Cloud by a fully-managed service called Google Cloud Dataflow.
Types of C++ functions:
Standard functions
User-defined functions
C++ function structure
Function signature
Function body
Declaring and Implementing C++ functions
One bridge to connect them all. Oracle GoldenGate for Big Data.UKOUG Tech 2018Gleb Otochkin
The presentation explain different use cases and topologies for Oracle GoldenGate Big Data adapters and show how we can offload our data to be analyzed in real time using modern Big Data technologies.
SIMD machines — machines capable of evaluating the same instruction on several elements of data in parallel — are nowadays commonplace and diverse, be it in supercomputers, desktop computers or even mobile ones. Numerous tools and libraries can make use of that technology to speed up their computations, yet it could be argued that there is no library that provides a satisfying minimalistic, high-level and platform-agnostic interface for the C++ developer.
At the beginning of 2021, Shopify Data Platform decided to adopt Apache Flink to enable modern stateful stream-processing. Shopify had a lot of experience with other streaming technologies, but Flink was a great fit due to its state management primitives.
After about six months, Shopify now has a flourishing ecosystem of tools, tens of prototypes from many teams across the company and a few large use-cases in production.
Yaroslav will share a story about not just building a single data pipeline but building a sustainable ecosystem. You can learn about how they planned their platform roadmap, the tools and libraries Shopify built, the decision to fork Flink, and how Shopify partnered with other teams and drove the adoption of streaming at the company.
Flyte is a structured programming and distributed processing platform created at Lyft that enables highly concurrent, scalable and maintainable workflows for machine learning and data processing. Welcome to the documentation hub for Flyte.
Maximize Big Data ROI via Best of Breed Patterns and PracticesJeff Bertman
******** Abstract: ********
Not long ago the question was whether your organization had big data. Did you have
the volume, the velocity, the technology. Now those basics are largely given for most of
the people attending this event. The path to success is still fuzzy, however, with so many
technologies to choose from – and so many ways to use them.
This presentation triangulates in a holistic manner on the modern business dilemma:
how can we leverage technology to improve revenue, profit, market share, and numerous
other success criteria. That said, this is not about the analytics or KPIs -- although it is
about measurable improvement. It’s about lining up the right technologies and using them
in effective, proven ways to maximize Return on Investment (ROI). Since the slant here
is holistic, we’ll show how to blend infrastructure, tools, methods, and talent to avoid and
constantly trim technical debt… and to produce success stories that are consistently
repeatable, not a byproduct of individual heroics.
How to build an ETL pipeline with Apache Beam on Google Cloud DataflowLucas Arruda
Nowadays more and more companies are searching for insights with potential to grow their business by analyzing large amounts of data from many different systems. However, in order to reach this level of Big Data Analysis it's necessary to build an ETL pipeline that allows us to process raw data coming from different sources into an appropriate format that is possible to use against visualization tools such as Tableau.
This kind of data processing can be done by a variety of tools and in this presentation I show how to do it by using an unified programming model created by Google and open-sourced as the name of Apache Beam. We will build a simple pipeline that will be executed in the Cloud by a fully-managed service called Google Cloud Dataflow.
Types of C++ functions:
Standard functions
User-defined functions
C++ function structure
Function signature
Function body
Declaring and Implementing C++ functions
One bridge to connect them all. Oracle GoldenGate for Big Data.UKOUG Tech 2018Gleb Otochkin
The presentation explain different use cases and topologies for Oracle GoldenGate Big Data adapters and show how we can offload our data to be analyzed in real time using modern Big Data technologies.
SIMD machines — machines capable of evaluating the same instruction on several elements of data in parallel — are nowadays commonplace and diverse, be it in supercomputers, desktop computers or even mobile ones. Numerous tools and libraries can make use of that technology to speed up their computations, yet it could be argued that there is no library that provides a satisfying minimalistic, high-level and platform-agnostic interface for the C++ developer.
At the beginning of 2021, Shopify Data Platform decided to adopt Apache Flink to enable modern stateful stream-processing. Shopify had a lot of experience with other streaming technologies, but Flink was a great fit due to its state management primitives.
After about six months, Shopify now has a flourishing ecosystem of tools, tens of prototypes from many teams across the company and a few large use-cases in production.
Yaroslav will share a story about not just building a single data pipeline but building a sustainable ecosystem. You can learn about how they planned their platform roadmap, the tools and libraries Shopify built, the decision to fork Flink, and how Shopify partnered with other teams and drove the adoption of streaming at the company.
Flyte is a structured programming and distributed processing platform created at Lyft that enables highly concurrent, scalable and maintainable workflows for machine learning and data processing. Welcome to the documentation hub for Flyte.
Maximize Big Data ROI via Best of Breed Patterns and PracticesJeff Bertman
******** Abstract: ********
Not long ago the question was whether your organization had big data. Did you have
the volume, the velocity, the technology. Now those basics are largely given for most of
the people attending this event. The path to success is still fuzzy, however, with so many
technologies to choose from – and so many ways to use them.
This presentation triangulates in a holistic manner on the modern business dilemma:
how can we leverage technology to improve revenue, profit, market share, and numerous
other success criteria. That said, this is not about the analytics or KPIs -- although it is
about measurable improvement. It’s about lining up the right technologies and using them
in effective, proven ways to maximize Return on Investment (ROI). Since the slant here
is holistic, we’ll show how to blend infrastructure, tools, methods, and talent to avoid and
constantly trim technical debt… and to produce success stories that are consistently
repeatable, not a byproduct of individual heroics.
Similar to Data Con LA 2022-Pre-recorded - Hamilton, General Purpose framework for Scalable Feature Engineering (20)
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA
Mike Limcaco, Analytics Specialist / Customer Engineer at Google
Measure trends in a particular topic or search term across Google Search across the US down to the city-level. Integrate these data signals into analytic pipelines to drive product, retail, media (video, audio, digital content) recommendations tailored to your audience segment. We'll discuss how Google unique datasets can be used with Google Cloud smart analytic services to process, enrich and surface the most relevant product or content that matches the ever-changing interests of your local customer segment.
Melinda Thielbar, Data Science Practice Lead and Director of Data Science at Fidelity Investments
From corporations to governments to private individuals, most of the AI community has recognized the growing need to incorporate ethics into the development and maintenance of AI models. Much of the current discussion, though, is meant for leaders and managers. This talk is directed to data scientists, data engineers, ML Ops specialists, and anyone else who is responsible for the hands-on, day-to-day of work building, productionalizing, and maintaining AI models. We'll give a short overview of the business case for why technical AI expertise is critical to developing an AI Ethics strategy. Then we'll discuss the technical problems that cause AI models to behave unethically, how to detect problems at all phases of model development, and the tools and techniques that are available to support technical teams in Ethical AI development.
Data Con LA 2022 - Improving disaster response with machine learningData Con LA
Antje Barth, Principal Developer Advocate, AI/ML at AWS & Chris Fregly, Principal Engineer, AI & ML at AWS
The frequency and severity of natural disasters are increasing. In response, governments, businesses, nonprofits, and international organizations are placing more emphasis on disaster preparedness and response. Many organizations are accelerating their efforts to make their data publicly available for others to use. Repositories such as the Registry of Open Data on AWS and Humanitarian Data Exchange contain troves of data available for use by developers, data scientists, and machine learning practitioners. In this session, see how a community of developers came together though the AWS Disaster Response hackathon to build models to support natural disaster preparedness and response.
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA
Sig Narvaez, Executive Solution Architect at MongoDB
MongoDB is now a Developer Data Platform. Come learn what�s new in the 6.0 release and Atlas following all the recent announcements made at MongoDB World 2022. Topics will include
- Atlas Search which combines 3 systems into one (database, search engine, and sync mechanisms) letting you focus on your product's differentiation.
- Atlas Data Federation to seamlessly query, transform, and aggregate data from one or more MongoDB Atlas databases, Atlas Data Lake and AWS S3 buckets
- Queryable Encryption lets you run expressive queries on fully randomized encrypted data to meet the most stringent security requirements
- Relational Migrator which analyzes your existing relational schemas and helps you design a new MongoDB schema.
- And more!
Data Con LA 2022 - Real world consumer segmentationData Con LA
Jaysen Gillespie, Head of Analytics and Data Science at RTB House
1. Shopkick has over 30M downloads, but the userbase is very heterogeneous. Anecdotal evidence indicated a wide variety of users for whom the app holds long-term appeal.
2. Marketing and other teams challenged Analytics to get beyond basic summary statistics and develop a holistic segmentation of the userbase.
3. Shopkick's data science team used SQL and python to gather data, clean data, and then perform a data-driven segmentation using a k-means algorithm.
4. Interpreting the results is more work -- and more fun -- than running the algo itself. We'll discuss how we transform from ""segment 1"", ""segment 2"", etc. to something that non-analytics users (Marketing, Operations, etc.) could actually benefit from.
5. So what? How did team across Shopkick change their approach given what Analytics had discovered.
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA
Ravi Pillala, Chief Data Architect & Distinguished Engineer at Intuit
TurboTax is one of the well known consumer software brand which at its peak serves 385K+ concurrent users. In this session, We start with looking at how user behavioral data & tax domain events are captured in real time using the event bus and analyzed to drive real time personalization with various TurboTax data pipelines. We will also look at solutions performing analytics which make use of these events, with the help of Kafka, Apache Flink, Apache Beam, Spark, Amazon S3, Amazon EMR, Redshift, Athena and Amazon lambda functions. Finally, we look at how SageMaker is used to create the TurboTax model to predict if a customer is at risk or needs help.
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA
George Mansoor, Chief Information Systems Officer at California State University
Overview of the CSU Data Architecture on moving on-prem ERP data to the AWS Cloud at scale using Delphix for Data Replication/Virtualization and AWS Data Migration Service (DMS) for data extracts
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA
Anand Ranganathan, Chief AI Officer at Unscrambl
Conversational AI is getting more and more widely used for customer support and employee support use-cases. In this session, I'm going to talk about how it can be extended for data analysis and data science use-cases ... i.e., how users can interact with a bot to ask analytical questions on data in relational databases.
This allows users to explore complex datasets using a combination of text and voice questions, in natural language, and then get back results in a combination of natural language and visualizations. Furthermore, it allows collaborative exploration of data by a group of users in a channel in platforms like Microsoft Teams, Slack or Google Chat.
For example, a group of users in a channel can ask questions to a bot in plain English like ""How many cases of Covid were there in the last 2 months by state and gender"" or ""Why did the number of deaths from Covid increase in May 2022"", and jointly look at the results that come back. This facilitates data awareness, data-driven collaboration and joint decision making among teams in enterprises and outside.
In this talk, I'll describe how we can bring together various features including natural-language understanding, NL-to-SQL translation, dialog management, data story-telling, semantic modeling of data and augmented analytics to facilitate collaborate exploration of data using conversational AI.
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA
Anil Inamdar, VP & Head of Data Solutions at Instaclustr
The most modernized enterprises utilize polyglot architecture, applying the best-suited database technologies to each of their organization's particular use cases. To successfully implement such an architecture, though, you need a thorough knowledge of the expansive NoSQL data technologies now available.
Attendees of this Data Con LA presentation will come away with:
-- A solid understanding of the decision-making process that should go into vetting NoSQL technologies and how to plan out their data modernization initiatives and migrations.
-- They will learn the types of functionality that best match the strengths of NoSQL key-value stores, graph databases, columnar databases, document-type databases, time-series databases, and more.
-- Attendees will also understand how to navigate database technology licensing concerns, and to recognize the types of vendors they'll encounter across the NoSQL ecosystem. This includes sniffing out open-core vendors that may advertise as “open source,"" but are driven by a business model that hinges on achieving proprietary lock-in.
-- Attendees will also learn to determine if vendors offer open-code solutions that apply restrictive licensing, or if they support true open source technologies like Hadoop, Cassandra, Kafka, OpenSearch, Redis, Spark, and many more that offer total portability and true freedom of use.
Data Con LA 2022 - Intro to Data ScienceData Con LA
Zia Khan, Computer Systems Analyst and Data Scientist at LearningFuze
Data Science tutorial is designed for people who are new to Data Science. This is a beginner level session so no prior coding or technical knowledge is required. Just bring your laptop with WiFi capability. The session starts with a review of what is data science, the amount of data we generate and how companies are using that data to get insight. We will pick a business use case, define the data science process, followed by hands-on lab using python and Jupyter notebook. During the hands-on portion we will work with pandas, numpy, matplotlib and sklearn modules and use a machine learning algorithm to approach the business use case.
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA
Mariana Danilovic, Managing Director at Infiom, LLC
We will address:
(1) Community creation and engagement using tokens and NFTs
(2) Organization of DAO structures and ways to incentivize Web3 communities
(3) DeFi business models applied to Web3 ventures
(4) Why Metaverse matters for new entertainment and community engagement models.
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA
Curtis ODell, Global Director Data Integrity at Tricentis
Join me to learn about a new end-to-end data testing approach designed for modern data pipelines that fills dangerous gaps left by traditional data management tools—one designed to handle structured and unstructured data from any source. You'll hear how you can use unique automation technology to reach up to 90 percent test coverage rates and deliver trustworthy analytical and operational data at scale. Several real world use cases from major banks/finance, insurance, health analytics, and Snowflake examples will be presented.
Key Learning Objective
1. Data journeys are complex and you have to ensure integrity of the data end to end across this journey from source to end reporting for compliance
2. Data Management tools do not test data, they profile and monitor at best, and leave serious gaps in your data testing coverage
3. Automation with integration to DevOps and DataOps' CI/CD processes are key to solving this.
4. How this approach has impact in your vertical
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA
Arif Ansari, Professor at University of Southern California
Super Bowl Ad cost $7 million and each year a few Super Bowl ads go viral. The traditional A/B testing does not predict virality. Some highly shared ones reach over 60 million organic views, which can be more valuable than views on TV. Not only are these voluntary, but they are typically without distraction, and win viewer engagement in the form of likes, comments, or shares. A Super Bowl ad that wins 69 million views on YouTube (e.g., Alexa Mind Reader) costs less than 10 cents per quality view! However, the challenge is triggering virality. We developed a method to predict virality and engineer virality into Ads.
1. Prof. Gerard J. Tellis and co-authors recommended that advertisers use YouTube to tease, test, and tweak (TTT) their ads to maximize sharing and viewing. 2022 saw that maxim put into practice.
2. We developed viral Ads prediction using two scientific models:
a. Prof. Gerard Tellis et al.'s model for viral prediction
b. Deep Learning viral prediction using social media effect
3. The model was able to identify all the top 15 Viral Ads it performed better than the traditional agencies.
4. New proposed method is Tease, Test, Tweak, Target and Spots Ad.
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA
Jai Bansal, Senior Manager, Data Science at Aetna
This talk describes an internal data product called Member Embeddings that facilitates modeling of member medical journeys with machine learning.
Medical claims are the key data source we use to understand health journeys at Aetna. Claims are the data artifacts that result from our members' interactions with the healthcare system. Claims contain data like the amount the provider billed, the place of service, and provider specialty. The primary medical information in a claim is represented in codes that indicate the diagnoses, procedures, or drugs for which a member was billed. These codes give us a semi-structured view into the medical reason for each claim and so contain rich information about members' health journeys. However, since the codes themselves are categorical and high-dimensional (10K cardinality), it's challenging to extract insight or predictive power directly from the raw codes on a claim.
To transform claim codes into a more useful format for machine learning, we turned to the concept of embeddings. Word embeddings are widely used in natural language processing to provide numeric vector representations of individual words.
We use a similar approach with our claims data. We treat each claim code as a word or token and use embedding algorithms to learn lower-dimensional vector representations that preserve the original high-dimensional semantic meaning.
This process converts the categorical features into dense numeric representations. In our case, we use sequences of anonymized member claim diagnosis, procedure, and drug codes as training data. We tested a variety of algorithms to learn embeddings for each type of claim code.
We found that the trained embeddings showed relationships between codes that were reasonable from the point of view of subject matter experts. In addition, using the embeddings to predict future healthcare-related events outperformed other basic features, making this tool an easy way to improve predictive model performance and save data scientist time.
Data Con LA 2022 - Data Streaming with KafkaData Con LA
Jie Chen, Manager Advisory, KPMG
Data is the new oil. However, many organizations have fragmented data in siloed line of businesses. In this topic, we will focus on identifying the legacy patterns and their limitations and introducing the new patterns packed by Kafka's core design ideas. The goal is to tirelessly pursue better solutions for organizations to overcome the bottleneck in data pipelines and modernize the digital assets for ready to scale their businesses. In summary, we will walk through three uses cases, recommend Dos and Donts, Take aways for Data Engineers, Data Scientist, Data architect in developing forefront data oriented skills.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
3. 3
Data Con 2021
Data Science @ Stitch Fix
algorithms-tour.stitchfix.com
- Data Science powers the user experience
- 145+ data scientists and data platform engineers
4. Hamilton is Open Source code
> pip install sf-hamilton
Get started in <15 minutes!
Documentation
https://hamilton-docs.gitbook.io/
Lots of examples:
https://github.com/stitchfix/hamilton/tree/main/examples
4
5. What is Hamilton?
Why was Hamilton created?
Hamilton @ Stitch Fix
Hamilton for feature engineering
↳ Scaling humans/teams
↳ Scaling compute/data
Plans for the future
The Agenda
6. What is Hamilton?
Why was Hamilton created?
Hamilton @ Stitch Fix
Hamilton for feature engineering
↳ Scaling humans/teams
↳ Scaling compute/data
Plans for the future
8. What is Hamilton?
declarative: naturally express as readable code
dataflow: represent transformation of any data
paradigm: a novel approach to writing python
8
9. What is Hamilton?
declarative: naturally express as readable code
dataflow: represent transformation of any data
paradigm: a novel approach to writing python
9
10. What is Hamilton?
declarative: naturally express as readable code
dataflow: represent transformation of any data
paradigm: a novel approach to writing python
10
12. Old Way vs Hamilton Way:
Instead of:
You declare:
12
df['c'] = df['a'] + df['b']
df['d'] = transform(df['c'])
def c(a: pd.Series, b: pd.Series) -> pd.Series:
"""Sums a with b"""
return a + b
def d(c: pd.Series) -> pd.Series:
"""Transforms C to ..."""
new_column = _transform_logic(c)
return new_column
13. Instead of:
You declare:
Inputs == Function Arguments
Old Way vs Hamilton Way:
13
df['c'] = df['a'] + df['b']
df['d'] = transform(df['c'])
def c(a: pd.Series, b: pd.Series) -> pd.Series:
"""Sums a with b"""
return a + b
def d(c: pd.Series) -> pd.Series:
"""Transforms C to ..."""
new_column = _transform_logic(c)
return new_column
Outputs == Function Name
14. Full Hello World
14
# feature_logic.py
def c(a: pd.Series, b: pd.Series) -> pd.Series:
"""Sums a with b"""
return a + b
def d(c: pd.Series) -> pd.Series:
"""Transforms C to ..."""
new_column = _transform_logic(c)
return new_column
# run.py
from hamilton import driver
import feature_logic
dr = driver.Driver({'a': ..., 'b': ...}, feature_logic)
df_result = dr.execute(['c', 'd'])
print(df_result)
Functions:
“Driver” - this actually says what and when to execute:
15. Hamilton TL;DR:
1. For each transform (=), you write a function(s).
2. Functions declare a DAG.
3. Hamilton handles DAG execution.
15
c
d
a b
# feature_logic.py
def c(a: pd.Series, b: pd.Series) -> pd.Series:
"""Replaces c = a + b"""
return a + b
def d(c: pd.Series) -> pd.Series:
"""Replaces d = transform(c)"""
new_column = _transform_logic(c)
return new_column
# run.py
from hamilton import driver
import feature_logic
dr = driver.Driver({'a': ..., 'b': ...},
feature_logic)
df_result = dr.execute(['c', 'd'])
print(df_result)
16. What is Hamilton?
Why was Hamilton created?
Hamilton @ Stitch Fix
Hamilton for feature engineering
↳ Scaling humans/teams
↳ Scaling compute/data
Plans for the future
20. Backstory: TS → DF → 🍝 Code
20
df = load_dates() # load date ranges
df = load_actuals(df) # load actuals, e.g. spend, signups
df['holidays'] = is_holiday(df['year'], df['week']) # holidays
df['avg_3wk_spend'] = df['spend'].rolling(3).mean() # moving average of spend
df['spend_per_signup'] = df['spend'] / df['signups'] # spend per person signed up
df['spend_shift_3weeks'] = df.spend['spend'].shift(3) # shift spend because ...
df['spend_shift_3weeks_per_signup'] = df['spend_shift_3weeks'] / df['signups']
def my_special_feature(df: pd.DataFrame) -> pd.Series:
return (df['A'] - df['B'] + df['C']) * weights
df['special_feature'] = my_special_feature(df)
# ...
Now scale this code to 1000+ columns & a growing team 😬
Human scaling 😔:
○ Testing / Unit testing 👎
○ Documentation 👎
○ Code Reviews 👎
○ Onboarding 📈 👎
○ Debugging 📈 👎
Underrated problem!
21. What is Hamilton?
Why was Hamilton created?
Hamilton @ Stitch Fix
Hamilton for feature engineering
↳ Scaling humans/teams
↳ Scaling compute/data
Plans for the future
22. Hamilton @ Stitch Fix
● Running in production for 2.5+ years
● Initial use-case manages 4000+ feature definitions
● All feature definitions are:
○ Unit testable
○ Documentation friendly
○ Centrally curated, stored, and versioned in git
● Data science teams ❤ it:
○ Enabled a monthly task to be completed 4x faster
○ Easy to onboard new team members
○ Code reviews are simpler
22
24. What is Hamilton?
Why was Hamilton created?
Hamilton @ Stitch Fix
Hamilton for feature engineering
↳ Scaling humans/teams
↳ Scaling compute/data
Plans for the future
25. Hamilton + Feature Engineering: Overview
featurization training inference
● Can model this all in Hamilton (if you wanted to)
● We’ll just focus on featurization
○ FYI: Hamilton works for any object type.
■ Here we’ll assume pandas for simplicity.
○ E.g. use Hamilton within Airflow, Dagster, Prefect, Flyte, Metaflow, Kubeflow, etc.
25
Load
Data
Transform
into
Features
Fit
Model(s)
Use
Model(s)
27. Code that needs to be written:
1. Functions to load data
a. normalize/create common index to join on
2. Transform/feature functions
a. Optional: model functions
3. Drivers materialize data
a. DAG is walked for only what’s needed
Modeling Featurization
27
Data
Loaders
Feature
Functions
Drivers
28. Code that needs to be written:
1. Functions to load data
a. normalize/create common index to join on
2. Transform/feature functions
a. Optional: model functions
3. Drivers materialize data
a. DAG is walked for only what’s needed
Modeling Featurization
28
Data
Loaders
Drivers
Feature
Functions
30. Q: Why is feature engineering hard?
A: Scaling!
31. Scaling: Served Two Ways
}
}
> Human/team:
● Highly coupled code
● Difficulty reusing/understand work
● Broken/unhealthy production pipelines
> Machine:
● Data is too big to fit in memory
● Cannot easily parallelize computation
31
Hamilton helps here!
Hamilton has
integrations here!
32. What is Hamilton?
Why was Hamilton created?
Hamilton @ Stitch Fix
Hamilton for feature engineering
↳ Scaling humans/teams
↳ Scaling compute/data
Plans for the future
33. How Hamilton Helps with Human/Team Scaling
33
Highly coupled code Decouples “functions” from use (driver code).
34. How Hamilton Helps with Human/Team Scaling
34
Highly coupled code Decouples “functions” from use (driver code).
Difficulty reusing/understand work Functions are curated into modules.
Everything is unit testable.
Documentation is natural.
Forced to align on naming.
35. How Hamilton Helps with Human/Team Scaling
35
Highly coupled code Decouples “functions” from use (driver code).
Difficulty reusing/understand work Functions are curated into modules.
Everything is unit testable.
Documentation is natural.
Forced to align on naming.
Broken/unhealthy production pipelines Debugging is straightforward.
Easy to version features via git/packaging.
Runtime data quality checks.
36. Hamilton functions:
Scaling Humans/Teams
36
# client_features.py
@tag(owner='Data-Science', pii='False')
@check_output(data_type=np.float64, range=(-5.0, 5.0), allow_nans=False)
def height_zero_mean_unit_variance(height_zero_mean: pd.Series,
height_std_dev: pd.Series) -> pd.Series:
"""Zero mean unit variance value of height"""
return height_zero_mean / height_std_dev
Hamilton Features:
● Unit testing ✅ always possible
● Documentation ✅ tags, visualization, function doc
● Modularity/reuse ✅ module curation & drivers
● Central feature definition store ✅ naming, curation, versioning
● Data quality ✅ runtime checks
37. Code base implications:
1. Functions are always in modules.
2. Driver script, i.e execution script, is decoupled from functions.
Scaling Humans/Teams
37
Module spend_features.py
Module marketing_features.py
Module customer_features.py
Driver script 1
> Code reuse from day one!
> Low maintenance to support many driver scripts
Driver script 2
Driver script 3
38. What is Hamilton?
Why was Hamilton created?
Hamilton @ Stitch Fix
Hamilton for feature engineering
↳ Scaling humans/teams
↳ Scaling compute/data
Plans for the future
39. Scaling Compute/Data with Hamilton
Hamilton has the following integrations out of the box:
● Ray
○ Single process -> Multiprocessing -> Cluster
● Dask
○ Single process -> Multiprocessing -> Cluster
● Pandas on Spark
○ Uses enables using Pandas Spark API with your Pandas code easily
● Switching to run on Ray/Dask/Pandas on Spark requires:
> Only changing driver.py code*
> Pandas on Spark also needs changing how data is loaded.
39
40. Scaling Hamilton: Driver Only Change
40
# run.py
from hamilton import driver
import data_loaders
import date_features
import spend_features
config = {...} # config, e.g. data_location
dr = driver.Driver(config,
data_loaders,
date_features,
spend_features)
features_wanted = [...] # choose subset wanted
feature_df = dr.execute(features_wanted)
save(feature_df, 'prod.features')
41. Hamilton + Ray: Driver Only Change
41
# run.py
from hamilton import driver
import data_loaders
import date_features
import spend_features
config = {...} # config, e.g. data_location
dr = driver.Driver(config,
data_loaders,
date_features,
spend_features)
features_wanted = [...] # choose subset wanted
feature_df = dr.execute(features_wanted)
save(feature_df, 'prod.features')
# run_on_ray.py
…
from hamilton import base, driver
from hamilton.experimental import h_ray
…
ray.init()
config = {...}
rga = h_ray.RayGraphAdapter(
result_builder=base.PandasDataFrameResult())
dr = driver.Driver(config,
data_loaders, date_features, spend_features,
adapter=rga)
features_wanted = [...] # choose subset wanted
feature_df = dr.execute(features_wanted,
inputs=date_features)
save(feature_df, 'prod.features')
ray.shutdown()
42. Hamilton + Dask: Driver Only change
42
# run.py
from hamilton import driver
import data_loaders
import date_features
import spend_features
config = {...} # config, e.g. data_location
dr = driver.Driver(config,
data_loaders,
date_features,
spend_features)
features_wanted = [...] # choose subset wanted
feature_df = dr.execute(features_wanted)
save(feature_df, 'prod.features')
# run_on_dask.py
…
from hamilton import base, driver
from hamilton.experimental import h_dask
…
client = Client(Cluster(...)) # dask cluster/client
config = {...}
dga = h_dask.DaskGraphAdapter(client,
result_builder=base.PandasDataFrameResult())
dr = driver.Driver(config,
data_loaders, date_features, spend_features,
adapter=dga)
features_wanted = [...] # choose subset wanted
feature_df = dr.execute(features_wanted,
inputs=date_features)
save(feature_df, 'prod.features')
client.shutdown()
43. Hamilton + Spark: Driver Change + loader
43
# run.py
from hamilton import driver
import data_loaders
import date_features
import spend_features
config = {...} # config, e.g. data_location
dr = driver.Driver(config,
data_loaders,
date_features,
spend_features)
features_wanted = [...] # choose subset wanted
feature_df = dr.execute(features_wanted)
save(feature_df, 'prod.features')
# run_on_pandas_on_spark.py
…
import pyspark.pandas as ps
from hamilton import base, driver
from hamilton.experimental import h_spark
…
spark = SparkSession.builder.getOrCreate()
ps.set_option(...)
config = {...}
skga = h_dask.SparkKoalasGraphAdapter(spark, spine='COLUMN_NAME',
result_builder=base.PandasDataFrameResult())
dr = driver.Driver(config,
spark_data_loaders, date_features,spend_features,
adapter=skga)
features_wanted = [...] # choose subset wanted
feature_df = dr.execute(features_wanted,
inputs=date_features)
save(feature_df, 'prod.features')
spark.stop()
44. Hamilton + Ray/Dask: How Does it Work?
44
# FUNCTIONS
def c(a: pd.Series, b: pd.Series) -> pd.Series:
"""Sums a with b"""
return a + b
def d(c: pd.Series) -> pd.Series:
"""Transforms C to ..."""
new_column = _transform_logic(c)
return new_column
# DRIVER
from hamilton import base, driver
from hamilton.experimental import h_ray
…
ray.init()
config = {...}
rga = h_ray.RayGraphAdapter(
result_builder=base.PandasDataFrameResult())
dr = driver.Driver(config,
data_loaders,
date_features,
spend_features,
adapter=rga)
features_wanted = [...] # choose subset wanted
feature_df = dr.execute(features_wanted,
inputs=date_features)
save(feature_df, 'prod.features')
ray.shutdown()
# DAG
45. Hamilton + Ray/Dask: How Does it Work?
45
# FUNCTIONS
def c(a: pd.Series, b: pd.Series) -> pd.Series:
"""Sums a with b"""
return a + b
def d(c: pd.Series) -> pd.Series:
"""Transforms C to ..."""
new_column = _transform_logic(c)
return new_column
# DRIVER
from hamilton import base, driver
from hamilton.experimental import h_ray
…
ray.init()
config = {...}
rga = h_ray.RayGraphAdapter(
result_builder=base.PandasDataFrameResult())
dr = driver.Driver(config,
data_loaders,
date_features,
spend_features,
adapter=rga)
features_wanted = [...] # choose subset wanted
feature_df = dr.execute(features_wanted,
inputs=date_features)
save(feature_df, 'prod.features')
ray.shutdown()
# Delegate to Ray/Dask
…
ray.remote(
node.callable).remote(**kwargs)
—---—---—---—---—---
dask.delayed(node.callable)(**kwargs)
# DAG
46. Hamilton + Spark: How Does it Work?
46
# FUNCTIONS
def c(a: pd.Series, b: pd.Series) -> pd.Series:
"""Sums a with b"""
return a + b
def d(c: pd.Series) -> pd.Series:
"""Transforms C to ..."""
new_column = _transform_logic(c)
return new_column
# DRIVER
from hamilton import base, driver
from hamilton.experimental import h_ray
…
ray.init()
config = {...}
rga = h_ray.RayGraphAdapter(
result_builder=base.PandasDataFrameResult())
dr = driver.Driver(config,
data_loaders,
date_features,
spend_features,
adapter=rga)
features_wanted = [...] # choose subset wanted
feature_df = dr.execute(features_wanted,
inputs=date_features)
save(feature_df, 'prod.features')
ray.shutdown()
# With Spark
…
Change these to load
Spark “Pandas”
equivalent object
instead.
Spark will take care
of the rest.
# DAG
47. Hamilton + Ray/Dask/Pandas on Spark: Caveats
Things to think about:
1. Serialization:
a. Hamilton defaults to serialization methodology of these frameworks.
2. Memory:
a. Defaults should work. But fine tuning memory on a “function” basis is not exposed.
3. Python dependencies:
a. You need to manage them.
4. Looking to graduate these APIs from experimental status
>> Looking for contributions here to extend support in Hamilton! <<
47
48. What is Hamilton?
Why was Hamilton created?
Hamilton @ Stitch Fix
Hamilton for feature engineering
↳ Scaling humans/teams
↳ Scaling compute/data
Plans for the future
49. Hamilton: Plans for the Future
49
● Async support for an online context
● Improvements to/configurability around runtime data checks
● Integration with external data sources
● Integrations with more OS compute frameworks
Modin · Ibis · FlyteKit · Metaflow · Dagster · Prefect
● <your idea here>
50. Summary: Hamilton for Feature/Data Engineering
● Hamilton is a declarative paradigm to describe data/feature
transformations.
○ Embeddable anywhere that runs python.
● It grew out of a need to tame a feature code base.
○ It’ll make yours better too!
● The Hamilton paradigm scales humans/teams through software
engineering best practices.
● Hamilton + Ray/Dask/Pandas on Spark enables one to:
scale humans/teams and scale data/compute.
● Hamilton is going exciting places!
50
51. Give Hamilton a Try!
We’d love your feedback
> pip install sf-hamilton
⭐ on github (https://github.com/stitchfix/hamilton)
☑ create & vote on issues on github
📣 join us on on Slack
(https://join.slack.com/t/hamilton-opensource/shared_invite/zt-1bjs72asx-wcUTgH7q7QX1igiQ5bbdcg)
51