"Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & Co-Founder of Pivii Technologies
Watch videos from Data Natives Berlin 2016 here: http://bit.ly/2fE1sEo
Visit the conference website to learn more: www.datanatives.io
Follow Data Natives:
https://www.facebook.com/DataNatives
https://twitter.com/DataNativesConf
https://www.youtube.com/c/DataNatives
Stay Connected to Data Natives by Email: Subscribe to our newsletter to get the news first about Data Natives 2017: http://bit.ly/1WMJAqS
About the Author:
Vladislav is an entrepreneur, machine learning enthusiast, and DevOps geek. Currently, he is co-founding a startup, running a data engineering consulting business, traveling and writing on data-related topics.
"What we learned from 5 years of building a data science software that actual...Dataconomy Media
"What we learned from 5 years of building a data science software that actually works for everybody." Dr. Dennis Proppe, CTO and Chief Data Scientist at GPredictive GmbH
Watch more from Data Natives Berlin 2016 here: http://bit.ly/2fE1sEo
Visit the conference website to learn more: www.datanatives.io
Follow Data Natives:
https://www.facebook.com/DataNatives
https://twitter.com/DataNativesConf
https://www.youtube.com/c/DataNatives
Stay Connected to Data Natives by Email: Subscribe to our newsletter to get the news first about Data Natives 2017: http://bit.ly/1WMJAqS
About the Author:
Dennis Proppe is the CTO and Chief Data Scientist at Gpredictive, where he helps building software that enables data scientists to build and deploy predictive models in a few minutes instead of weeks. He has 10 years+ of expertise in extracting business value from data. Before co-founding Gpredictive, he worked as a marketing science consultant. Dennis holds a Ph.D. in statistical marketing.
From the Predictive Analytics Innovation Summit
Video here: https://www.youtube.com/watch?v=PdKUt0zK0UY
With the avalanche of data about operations, customers, and products, leading companies are utilizing Big Analytics to better understand historical patterns and predict what may come next to create sustained competitive advantage. Dan Mallinger, who leads Think Big Analytic's data science team, will focus on practical examples of where companies are implementing new analytics approaches over big data. Dan will discuss how these efforts differ from traditional analytic approaches, the organizational and business impact, and how our clients are creating new value in areas such as marketing, services, sales and product development.
Applied Data Science Course Part 2: the data science workflow and basic model...Dataiku
In the second part of our applied machine learning online course, you'll get an overview of the different steps in the data science workflow as well as a deep dive in 3 basic types of models: linear, tree-based and clustering.
Applied Data Science Course Part 1: Concepts & your first ML modelDataiku
In this first course of our Applied Data Science online course series, you'll learn about the mindset shift of going from small to big data, basic definitions and concepts, and an overview of the data science workflow.
How To Run A Successful BI Project with HadoopMammoth Data
A more in depth explanation of Business Intelligence and why being data driven is an asset. Also a quick examination of what Hadoop is and what it can do.
"What we learned from 5 years of building a data science software that actual...Dataconomy Media
"What we learned from 5 years of building a data science software that actually works for everybody." Dr. Dennis Proppe, CTO and Chief Data Scientist at GPredictive GmbH
Watch more from Data Natives Berlin 2016 here: http://bit.ly/2fE1sEo
Visit the conference website to learn more: www.datanatives.io
Follow Data Natives:
https://www.facebook.com/DataNatives
https://twitter.com/DataNativesConf
https://www.youtube.com/c/DataNatives
Stay Connected to Data Natives by Email: Subscribe to our newsletter to get the news first about Data Natives 2017: http://bit.ly/1WMJAqS
About the Author:
Dennis Proppe is the CTO and Chief Data Scientist at Gpredictive, where he helps building software that enables data scientists to build and deploy predictive models in a few minutes instead of weeks. He has 10 years+ of expertise in extracting business value from data. Before co-founding Gpredictive, he worked as a marketing science consultant. Dennis holds a Ph.D. in statistical marketing.
From the Predictive Analytics Innovation Summit
Video here: https://www.youtube.com/watch?v=PdKUt0zK0UY
With the avalanche of data about operations, customers, and products, leading companies are utilizing Big Analytics to better understand historical patterns and predict what may come next to create sustained competitive advantage. Dan Mallinger, who leads Think Big Analytic's data science team, will focus on practical examples of where companies are implementing new analytics approaches over big data. Dan will discuss how these efforts differ from traditional analytic approaches, the organizational and business impact, and how our clients are creating new value in areas such as marketing, services, sales and product development.
Applied Data Science Course Part 2: the data science workflow and basic model...Dataiku
In the second part of our applied machine learning online course, you'll get an overview of the different steps in the data science workflow as well as a deep dive in 3 basic types of models: linear, tree-based and clustering.
Applied Data Science Course Part 1: Concepts & your first ML modelDataiku
In this first course of our Applied Data Science online course series, you'll learn about the mindset shift of going from small to big data, basic definitions and concepts, and an overview of the data science workflow.
How To Run A Successful BI Project with HadoopMammoth Data
A more in depth explanation of Business Intelligence and why being data driven is an asset. Also a quick examination of what Hadoop is and what it can do.
Is Agile Data Science just two buzzwords put together? I argue that agile is a very practical and applicable methodology, that does work well in the real world for all sorts of Analytics and Data Science workflows.
http://theinnovationenterprise.com/summits/digital-web-analytics-summit-london-2015/schedule
A Big (Query) Frog in a Small Pond, Jakub Motyl, BuffPanelData Science Club
How does one process 200GB of streaming raw data, daily? Where dedicated servers and home-made solutions fail, BigQuery comes out the victor. We will talk about the big data architecture with over 110 million players total on record, how we managed to implement it, and how is it possible that we keep daily operational costs under $50.
In the beginning we will explain what kinds of data sources a top-selling game has to integrate and analyze and how to pre-process the data to avoid ramping up costs in disaster scenarios. Part of the talk is also dedicated to all the components that are involved in the many transformations the data undergoes and we will show you how the output from the entire pipeline looks.
Behind the AI curtain: Designing for trust in machine learning productsSoftware Guru
This session covers three key principles for how design and data science teams can work together better to build greater trust among users. Additionally, a case study on how a design and data science team partnered to redesign predictive analytics scores powered by machine learning will illustrate those principles in practice.
Por Crystal Yan
The 3 Key Barriers Keeping Companies from Deploying Data Products Dataiku
Getting from raw data to deploying data-driven solutions requires technology, data, and people. All of which exist. So why aren’t we seeing more truly data-driven companies: what's missing and why? During Strata Hadoop World Singapore 2015, Pauline Brown, Director of Marketing at Dataiku, explains how lack of collaboration is what is keeping companies from building and deploying data products effectively. Learn more about Dataiku and Data Science Studio: www.dataiku.com
DISUMMIT - Rishi Nalin Kumar from DatakindDigitYser
Rishi Nalin Kumar
Chief Scientist at eBench
Half professional, half collaborator, one quarter mathematician. Currently at eBench helping brands understand their consumers and win with their content. Previously leading data science & analytics in large-corporate consumer goods with a light touch of news & media. Proud volunteer at DataKind and a regular on the data & analytics speaker circuit.
BADR (www.badrit.com) - Data at your finger tips.
BADR is a software company addresses Web/Mobile Applications, Big Data Engineering, and Analytics Gap.
Solidify your technical startup with a robust MVP. Check our success stories with USA startups.
Big data perspective solution & technologyPankaj Khattar
This presentation talks about the requirement, growth, usage of big data platform. Discuss in brief about the available tools for big data under various branch of Visualisation, Analytics & Storage
Is Agile Data Science just two buzzwords put together? I argue that agile is a very practical and applicable methodology, that does work well in the real world for all sorts of Analytics and Data Science workflows.
http://theinnovationenterprise.com/summits/digital-web-analytics-summit-london-2015/schedule
A Big (Query) Frog in a Small Pond, Jakub Motyl, BuffPanelData Science Club
How does one process 200GB of streaming raw data, daily? Where dedicated servers and home-made solutions fail, BigQuery comes out the victor. We will talk about the big data architecture with over 110 million players total on record, how we managed to implement it, and how is it possible that we keep daily operational costs under $50.
In the beginning we will explain what kinds of data sources a top-selling game has to integrate and analyze and how to pre-process the data to avoid ramping up costs in disaster scenarios. Part of the talk is also dedicated to all the components that are involved in the many transformations the data undergoes and we will show you how the output from the entire pipeline looks.
Behind the AI curtain: Designing for trust in machine learning productsSoftware Guru
This session covers three key principles for how design and data science teams can work together better to build greater trust among users. Additionally, a case study on how a design and data science team partnered to redesign predictive analytics scores powered by machine learning will illustrate those principles in practice.
Por Crystal Yan
The 3 Key Barriers Keeping Companies from Deploying Data Products Dataiku
Getting from raw data to deploying data-driven solutions requires technology, data, and people. All of which exist. So why aren’t we seeing more truly data-driven companies: what's missing and why? During Strata Hadoop World Singapore 2015, Pauline Brown, Director of Marketing at Dataiku, explains how lack of collaboration is what is keeping companies from building and deploying data products effectively. Learn more about Dataiku and Data Science Studio: www.dataiku.com
DISUMMIT - Rishi Nalin Kumar from DatakindDigitYser
Rishi Nalin Kumar
Chief Scientist at eBench
Half professional, half collaborator, one quarter mathematician. Currently at eBench helping brands understand their consumers and win with their content. Previously leading data science & analytics in large-corporate consumer goods with a light touch of news & media. Proud volunteer at DataKind and a regular on the data & analytics speaker circuit.
BADR (www.badrit.com) - Data at your finger tips.
BADR is a software company addresses Web/Mobile Applications, Big Data Engineering, and Analytics Gap.
Solidify your technical startup with a robust MVP. Check our success stories with USA startups.
Big data perspective solution & technologyPankaj Khattar
This presentation talks about the requirement, growth, usage of big data platform. Discuss in brief about the available tools for big data under various branch of Visualisation, Analytics & Storage
Data Con LA 2019 - Move Fast, Think Big: The Principals of Managing Large Sca...Data Con LA
In this talk we will dive into some of the foundational principals of managing software development.These principals are a critical part of every Dev Leader's handbook to ensure that they are set for success. Such as:Roadmap Planning and Goal setting via OKRs (Objective Key Results)StaffingSound Architecture SaaS, PaaS & Open sourceCI/ CDTestingThis session will include real life examples that demonstrate how strategic management decisions can transform the way any development organization can become a product development success story.
This talk covers the role of a Data Engineer in the industry, including why companies need them and what they do on a day to day basis. The talk will include an overview of some of the common skills required for this role. A few example projects from the industry are also described briefly.
This talk was given to the students of Federal Institute of Science and Technology, India on 29th March 2021.
Tackle Your Everyday Business Problems Like an Architect, Melissa ShepardCzechDreamin
What does it take to problem solve like an architect? There are many skills that you can bring into your day-to-day job as an admin or developer in order to help you put on your “architect hat”. Go beyond just solving the problem at hand and expand your thinking to take into consideration the bigger picture applying common architect skills used to solve business problems. We will cover some of these skills and show you how to apply pieces of the pyramid to common business scenarios so that you, too, can problem solve like an architect.
Webinar | Good Guys vs. Bad Data: How to Be a Data Quality HeroAngela Sun
Duplicate contact and account records. Missing field values. Inaccurate or outdated information. Unstandardized data. Typos upon typos. If you manage and operate sales and marketing tools, you’ve likely encountered these bad data scenarios (and others!).
We get it — data quality isn’t the sexiest topic. But the impact of poor data quality is undeniable, causing 21% of marketing budgets to be wasted, according to research from Forrester and Marketing Evolution. Furthermore, factors like increasing competition and evolving buyer needs continue to make data health even more important. Improving data quality can stretch your marketing dollars further, enable operational efficiency, and act as a strategic growth lever.
Tim Liu (Head of Product at Hull) and Brad Smith (Co-founder & CEO at Sonar) have spent their entire careers working in data integration and operations, so they’ve seen it all. In this webinar, they’ll share:
- Data quality nightmares they’ve personally dealt with
- Common scenarios where bad data can rear its ugly head
- Proactive strategies for getting ahead
Running a small, high tech consulting firm - lessons learnedPere Ferrera Bertran
In this talk I describe my experience as CTO of Big Data consulting firm Datasalt from 2011 to 2016, the main use cases done for companies and the lessons learned from such a experience.
Data. It keeps coming up time and time again. On our social media feeds, in our client conversations, and has of course been the driver behind never-before-seen tools like ChatGPT.
But how can you do more with the data your organisation has and produces? What is data engineering and big data, and how can you enable data-driven decision-making within your organisation?
Hear from Nabi Rezvani—Lead Data Engineer—and Gaurav Thadani—Lead Software Engineer at DiUS on the latest trends, use cases and real-life examples of how our clients are using data and analytics to improve their decision making, customer experiences and business operations.
Also joining us are Jonathan Gomez—Head of Data Platforms at Wesfarmers OneDigital OnePass—and John Sullivan—CEO at ChargeFox—on their own [big and small] data journeys, along with the lessons they’ve learned along the way.
Watch the presentation on YouTube: https://youtu.be/ccghOfcdGN8
IWMW 2004: It Always Takes Longer Than You Think (Even If You Think It Will T...IWMW
Slides used in workshop session B1 on "It Always Takes Longer Than You Think (Even If You Think It Will Take Longer Than You Think)" at the IWMW 2004 event held at the University of Birmingham on 27-29 July 2004.
See http://www.ukoln.ac.uk/web-focus/events/workshops/webmaster-2004/sessions/walker/
Moving EA - from where we are to where we should beLeanIX GmbH
Presentation held by Dr. Stefan Zerbe, ITM at EA Connect Days 2018 in Bonn. While EA (enterprise architecture) is a well-known discipline many business organizations struggle with maturity of their corporate EA practice. But even companies that stopped EA activities in recent years now relaunch EA, in order to tackle challenges resulting from digital transformation, regulatory pressure and increasing IT complexity. This presentation builds on lessons learned from companies working with EA and refocuses the EA value proposition in order to meet business expectations. It highlights the importance to extent EA thinking to business units and explains collaboration modelsto engage both, business and IT architecture specialists as well as managers, in joint architecture activities based on a real project example. From a business-oriented perspective on EA, the presentation picks up the discussion regarding a new agile mindset for EA architects and presents new ideas for tool sets to support EA work in corporations.
Convince Management to Invest in a CCMS (Lessons learned)Publishing Smarter
You are keenly aware of the benefits of a Component Content Management System (CCMS). Next steps; develop a business case to present to management. Answering their questions about an important investment decision means prepping for their key questions. This webinar presents you with questions management may ask, information on researching answers, and guidance on how to make your case. Learn through stories of companies and people who have a CCMS. Hear what they did to convince their management teams to make the long term investments that pay dividends for years to come.
Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...Dataconomy Media
The challenges of increasing complexity of organizations, companies and projects are obvious and omnipresent. Everywhere there are connections and dependencies that are often not adequately managed or not considered at all because of a lack of technology or expertise to uncover and leverage the relationships in data and information. In his presentation, Axel Morgner talks about graph technology and knowledge graphs as indispensable building blocks for successful companies.
Data Natives Munich v 12.0 | "How to be more productive with Autonomous Data ...Dataconomy Media
Every day we are challenged with more data, more use cases and an ever increasing demand for analytics. In this talk Bjorn will explain how autonomous data management and machine learning help innovators to more productive and give examples how to deliver new data driven projects with less risk at lower costs.
Data Natives meets DataRobot | "Build and deploy an anti-money laundering mo...Dataconomy Media
Compliance departments within banks and other financial institutions are turning to machine learning for improving their Anti Money Laundering compliance activities. Today, the systems that aim to detect potentially suspicious activity are commonly rule-based, and suffer from ultra-high false positive rates. DataRobot will discuss how their Automated Machine Learning platform was successfully used for a real use case to reduce their false positives and to enhance their Anti-Money Laundering activities.
Data Natives Munich v 12.0 | "Political Data Science: A tale of Fake News, So...Dataconomy Media
Trump, Brexit, Cambridge Analytica... In the last few years, we have had to confront the consequences of the use and misuse of data science algorithms in manipulating public opinion through social media. The use of private data to microtarget individuals is a daily practice (and a trillion-dollar industry), which has serious side-effects when the selling product is your political ideology. How can we cope with this new scenario?
Data Natives Vienna v 7.0 | "The Ingredients of Data Innovation" - Robbert de...Dataconomy Media
When taking a deep dive into the world of data, one thing is certain: the ultimate goal is to create something new, something better, something faster. In other words, innovation should always be at the forefront of companies strategic outlook, whether their goal is to pioneer new processes, user experiences, products or services.
Data Natives Cologne v 4.0 | "The Data Lorax: Planting the Seeds of Fairness...Dataconomy Media
What does it take to build a good data product or service? Data practitioners always think about the technology, user experience and commercial viability. But rarely do they think about the implications of the systems they build. This talk will shed light on the impact of AI systems and the unintended consequences of the use of data in different products. It will also discuss our role, as data practitioners, in planting the seeds of fairness in the systems we build.
Data Natives Cologne v 4.0 | "How People Analytics Can Reveal the Hidden Aspe...Dataconomy Media
We all hear about the power of data, big data and data analysis in todays market place. But rarely feel it's touchable effects on our own business decisions and performance.
Let's dive into it and see how can people analytics increase people performance, motivation and business revenue?
Data Natives Amsterdam v 9.0 | "Ten Little Servers: A Story of no Downtime" -...Dataconomy Media
Cloud Infrastructure is a hostile environment: a power supply failure or a network outage leads to downtime and big losses. There is nothing we can trust: a single server, a server rack, even a whole datacenter can fail, and if an application is fragile by design, disruption is inevitable. We must distribute our application and diversify cloud data strategy to survive disturbances of any scale. Apache Cassandra is a cloud-native platform-agnostic database that stores data with a distributed redundancy so it easily survives any issue. What to know how Apple and Netflix handle petabytes of data, keeping it highly available? Join us and listen to a story of 10 little servers and no downtime!
Data Natives Amsterdam v 9.0 | "Point in Time Labeling at Scale" - Timothy Th...Dataconomy Media
In the data industry, having correctly labelled datasets is vital. Timothy Thatcher explains how tagging your data while considering time and location and complex hierarchical rules at scale can be handled.
Data NativesBerlin v 20.0 | "Serving A/B experimentation platform end-to-end"...Dataconomy Media
During the lifetime of an A/B test product managers and analysts in GetYourGuide require various tools and different kinds of data to plan the trial properly, control it during the run and analyze the results at the end. This talk would be about the architecture, tools and data flow for serving their needs.
Data Natives Berlin v 20.0 | "Ten Little Servers: A Story of no Downtime" - A...Dataconomy Media
Cloud Infrastructure is a hostile environment: a power supply failure or a network outage leads to downtime and big losses. There is nothing we can trust: a single server, a server rack, even a whole datacenter can fail, and if an application is fragile by design, disruption is inevitable. We must distribute our application and diversify cloud data strategy to survive disturbances of any scale. Apache Cassandra is a cloud-native platform-agnostic database that stores data with a distributed redundancy so it easily survives any issue. What to know how Apple and Netflix handle petabytes of data, keeping it highly available? Join us and listen to a story of 10 little servers and no downtime!
Big Data Frankfurt meets Thinkport | "The Cloud as a Driver of Innovation" - ...Dataconomy Media
Creativity is the mental ability to create new ideas and designs. Innovation, on the other hand, Means developing useful solutions from new ideas. Creativity can be goal-oriented, Whereas innovation is always goal-oriented. This bedeutet, dass innovation aims to achieve defined goals. The use of cloud services and technologies promises enterprise users many benefits in terms of more flexible use of IT resources and faster access to innovative solutions. That’s why we want to examine the question in this talk, of what role cloud computing plays for innovation in companies.
Thinkport meets Frankfurt | "Financial Time Series Analysis using Wavelets" -...Dataconomy Media
Presentation of Time Series Properties of Financial Instrument and Possibilities in Frequency Decomposition and Information Extraction using FT, STFT and Wavelets with Outlook in Current Research on Wavelet Neural Networks
Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...Dataconomy Media
"With most machine learning (ML) and deep learning (DL) frameworks, it can take hours to move data for ETL, and hours to train models. It's also hard to scale, with data sets increasingly being larger than the capacity of any single server. The amount of the data also makes it hard to incrementally test and retrain models in near real-time.
Learn how Apache Ignite and GridGain help to address limitations like ETL costs, scaling issues and Time-To-Market for the new models and help achieve near-real-time, continuous learning.
Yuriy Babak, the head of ML/DL framework development at GridGain and Apache Ignite committer, will explain how ML/DL work with Apache Ignite, and how to get started.
Topics include:
— Overview of distributed ML/DL including architecture, implementation, usage patterns, pros and cons
— Overview of Apache Ignite ML/DL, including built-in ML/DL algorithms, and how to implement your own
— Model inference with Apache Ignite, including how to train models with other libraries, like Apache Spark, and deploy them in Ignite
— How Apache Ignite and TensorFlow can be used together to build distributed DL model training and inference"
Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...Dataconomy Media
"Machine learning algorithms require significant amounts of training data which has been centralized on one machine or in a datacenter so far. For numerous applications, such need of collecting data can be extremely privacy-invasive. Recent advancements in AI research approach this issue by a new paradigm of training AI models, i.e., Federated Learning.
In federated learning, edge devices (phones, computers, cars etc.) collaboratively learn a shared AI model while keeping all the training data on device, decoupling the ability to do machine learning from the need to store the data in the cloud. From personal data perspective, this paradigm enables a way of training a model on the device without directly inspecting users’ data on a server. This talk will pinpoint several examples of AI applications benefiting from federated learning and the likely future of privacy-aware systems."
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Adjusting primitives for graph : SHORT REPORT / NOTES
"Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & Co-Founder of Pivii Technologies
1. Data Pipeline ArchitectData Pipeline Architect
Data Pipelines
For small, messy and tedious data.
Vladislav Supalov, 27th October 2016
2. Data Pipeline ArchitectData Pipeline Architect
How to tell if this talk is for you?
2
2
● Big Data
○ Pretty fascinating
○ “Good problem to have”
● Most companies
○ Not quite there
○ Should not start at this level
● This is for you, if you are close to the data at a
○ Startup
○ Growing company
○ Established company which is about to start an initiative
● Working with a new CDO, CAO, Head of BI
3. Data Pipeline ArchitectData Pipeline Architect
I want to help you achieve better results!
3
3
● What will help you to deal with …?
○ small data (not much is needed to be valuable)
○ messy data (multiple data sources, no overview)
○ tedious-to-handle data (multiple data sources, lots of manual work)
● “Use <tech X> in <way Y> and you will be fine”. Nope.
○ Just dealing with data is not a magic bullet
○ This will not guarantee good results for your company
○ You might get lucky of course. That’s not a safe bet.
● How can we improve your chances? Reduce risk.
○ Focus on what matters
4. Data Pipeline ArchitectData Pipeline Architect
Jumping to tech we would dive too deep, too early.
4
4
● What people tend to think about first:
○ Dashboards
○ Tools
○ Technical solutions, best practices & tricks
● That’s tactics
● We should not jump into implementation details right away.
● Let’s not.
5. Data Pipeline ArchitectData Pipeline Architect
The Craft of Designing & Building Data Pipelines
Should start with understanding the business.
6. Data Pipeline ArchitectData Pipeline Architect
Hi, I’m Vladislav!
6
6
● Data background
○ Machine learning, computer vision, data mining
● Fascination with DevOps
○ Efficient, reliable infrastructure setups
○ Monitoring, automation, processes
● Currently: Co-founding a startup - Pivii Technologies
○ Startup, accelerated by Axel Springer Plug and Play
○ Artificial intelligence for content marketing
○ AI, ML, CV, data!
○ pivii.co
● Previously: Building a data engineering consulting business
○ datapipelinearchitect.com
vsupalov
7. Data Pipeline ArchitectData Pipeline Architect
Preferred consulting situation:
7
7
● Mobile application marketing agency
○ Not necessarily huge data
○ Very valuable and worthwhile (from a certain point)
● “We built prototype analytics tools in-house and they are mostly functional”
○ “We have seen the value!”
○ But are painful to work with & broken
○ “Time and money is still being wasted.”
● Tools were created out of an actual need
○ Organically, little planning
○ “How can we do better?”
○ “Where do we go from here?”
8. Data Pipeline ArchitectData Pipeline Architect
Common Success Pattern: Business Value was Created.
Already achieved visible and measurable impact for the company.
Or have gotten VERY close to do so. Are thinking about ROI.
9. Data Pipeline ArchitectData Pipeline Architect
Business first. Tech follows.
9
9
● Key to successful data projects
○ Especially with limited resources
○ And small data
● Technical decisions should be informed by business needs and goals
● Handling data is a very small part of the whole
○ Straightforward once business needs are clear
● It starts with the mindset
○ Don't consider data plumbing in isolation
10. Data Pipeline ArchitectData Pipeline Architect
Key: being conscious and deliberate about the intention of
creating business value.
Let’s take a brief detour.
11. Data Pipeline ArchitectData Pipeline Architect
Consider sword fighting.
11
11
● A great samurai sword master
● 1584 - 1645
● Miyamoto Musashi
○ Martial artist
○ Tactician
○ Strategist
○ Artist
○ Sculptor
○ Calligrapher
○ Writer
○ Philosopher
○ ...
Images: Miyamoto Musashi, self-portrait, http://sv-musashi1.com/about_Musashi.htm,
Musashi Miyamoto with two Bokken, http://www.akinokai.org/images/Images.htm?Musashi.jpg
12. Data Pipeline ArchitectData Pipeline Architect
“The primary thing when you take a sword in your hands is
your intention to cut the enemy, whatever the means.”
- Miyamoto Musashi, The Book of Five Rings
13. Data Pipeline ArchitectData Pipeline Architect
“Whenever you parry, hit, spring, strike or touch the
enemy’s cutting sword, you must cut the enemy
in the same movement.”
- Miyamoto Musashi, The Book of Five Rings
14. Data Pipeline ArchitectData Pipeline Architect
“It is essential to attain this.
If you think only of hitting, springing, striking or touching
the enemy, you will not be able actually to cut him.”
- Miyamoto Musashi, The Book of Five Rings
15. Data Pipeline ArchitectData Pipeline Architect
“More than anything, you must be thinking
of carrying your movement through to cutting him.
You must thoroughly research this.”
- Miyamoto Musashi, The Book of Five Rings
16. Data Pipeline ArchitectData Pipeline Architect
The Goal of swordfighting is to cut the opponent.
16
16
● Stating this makes it seem very obvious.
○ Why the effort and emphasis?
● It’s not. Even for aspiring practitioners.
○ Results suffer.
● Mindset is essential for mastery
● The core advice (to my understanding):
○ Attain, cultivate and apply a goal-oriented mindset
○ Aim every step you take towards the goal
17. Data Pipeline ArchitectData Pipeline Architect
Back to the world of data-handling businesses!
17
17
● When working with company data
○ Before starting out on a project
○ Understand what you want and can achieve
○ Aim to create a positive impact on the business
○ Make it a constant, conscious goal
● The main tasks to do so are:
○ Understand the business
○ Understand the people
■ It’s about communication
○ Understand current processes
○ Be prepared to learn and revise
18. Data Pipeline ArchitectData Pipeline Architect
Use this process when approaching a new project:
18
18
● Qualify client/project
○ Does it make sense to get involved?
○ Is it evident that we can create value?
● Perform conversations/interviews
○ Find out more about the context
■ company, status, goals, limitations...
○ Learn from first-hand experience
● Summarize information, learnings and plans in writing
○ Roadmap document
○ Depicting the situation and ways forward
19. Data Pipeline ArchitectData Pipeline Architect
Is there potential
for a good fit?
Do budget, topic and goals seem in order?
20. Data Pipeline ArchitectData Pipeline Architect
Qualifying considerations. Learning about the client and project.
20
20
● What are you working on?
● What part of the project would you like help with?
● What needs to happen to make this a success for you?
● Why was this project started? What are the business goals?
● Is there an event that triggered it?
● Why especially now?
● What’s the budget? (ballpark estimate)
● When are you looking to get started?
21. Data Pipeline ArchitectData Pipeline Architect
Still good? Let’s start a
business relationship.
Initial research and planning. Roadmapping consulting package.
22. Data Pipeline ArchitectData Pipeline Architect
Four people to talk to:
22
22
● Project owner
○ We want this guy to be successful
● Business owner or C-level perspective
○ Knows what’s best for the business
○ "What could the ceo ask you in the hallway"
● Data wrangler - tales from the trenches
○ Insights into day-to-day business and data details
● Engineering Side
○ Current tech stack
○ Infos on constraints and preferences
○ Last touches
● Conversation focus, questions and duration vary from person to person.
23. Data Pipeline ArchitectData Pipeline Architect
Interviews completed, situation understood and put into writing.
23
● A bit of focused communication, we have a great foundation!
○ Project motivation
○ Business goals
○ Who should benefit
○ How to make it happen
● Different perspectives on the project and business.
● Time for tech!
○ Context clear (goals, constraints)
● Best case:
○ Very few choices left to make
24. Data Pipeline ArchitectData Pipeline Architect
Here’s what I would have told myself when starting out:
24
● Learn about the company
○ Easier with fresh eyes
● Understand the business
○ Multiple perspectives
● Keep the goal in mind
○ Helps learning the right things
○ Cultivate a business mindset (help earn more/lose less)
○ Aim for results
■ I will not stop saying this anytime soon :)
● Have a process laid out
24
25. Data Pipeline ArchitectData Pipeline Architect
Finally: Tactical Advice Which Fits the Remaining Time.
That’s the right proportion :)
26. Data Pipeline ArchitectData Pipeline Architect
Don’t roll your own home-baked scripts.
26
26
● "Quick and easy" isn't
● Uniqueness is bad, boring is good
○ Learning curve for others
○ Original author leaving
○ Maintenance time, tricky bugs, code duplication
○ Unexpected failure modes
● Extensibility?
● Growth?
● Metadata?
27. Data Pipeline ArchitectData Pipeline Architect
You should know about workflow engines.
27
27
● Workflow = “[..] orchestrated and repeatable pattern of business activity [..]” [1]
● Data flow = “bunch of data processing tasks with inter-dependencies” [2]
● Pipelines of batch jobs
○ complex, long-running
● Dependency management
● Reusability of intermediate steps
● Logging and alerting
● Failure handling
● Monitoring
● Lots of effort went into them (Broken data? Crashes? Partial failures?)
[1] https://en.wikipedia.org/wiki/Workflow
[2] Elias Freider, 2013, “Luigi - Batch Data Processing in Python“
28. Data Pipeline ArchitectData Pipeline Architect
If in doubt, try Luigi.
28
28
● Spotify
○ Lots of data!
○ 10k+ Hadoop jobs every day [1]
● Battle hardened
○ Published 2009
○ Has been used in production by large companies for a while
● Python
● Modular & extensible
● Dependency graph
● Not just for data tasks
[1] Erik Bernhardsson, 2013, “Building Data Pipelines with Python and Luigi”
29. Data Pipeline ArchitectData Pipeline Architect
Usually worthwhile pipeline properties:
29
29
● Keep it small and lean
● Make learning and iterating easy
○ Changes should be cheap to accommodate for (both time and money)
● Build something to start learning
● Get data into one place
● Don’t reinvent the wheel
○ The tools are out there
○ ETL and workflow engines
● Create quick positive results, be efficient (lazy)
○ Many small improvements everywhere
○ Instead of solving everything for one group
○ More bang-for-the-buck
30. Data Pipeline ArchitectData Pipeline Architect
In conclusion:
30
● Don’t dive into tactics right away
● Aim to create business value
○ Make it a conscious goal
● Understand the business, people and processes
○ This will take some time. It’s a good investment.
○ Have a process yourself
○ Tech choices will follow
● Try to make it easy to learn and iterate
● Get data in one place
● Don’t go with home-baked scripts
● Consider workflow engines
○ Luigi in particular30
31. Data Pipeline ArchitectData Pipeline Architect
Thanks! Want to learn more?
“What questions to ask? Am I missing something?”
For your future interviews and planning:
I want to share my seed-question lists with you!
Just drop me your email address at:
http://datapipelinearchitect.com/datanatives/