This document discusses using Python for data logistics and ETL processes. It defines data logistics as the management of data in motion, including extract, transform, load, and other processes. It notes that data logistics is a complex problem involving many data flows and transformations. It argues that Python is a good fit for data logistics due to its versatility, readability, extensive libraries, and ability to be used across all stages from ETL to analysis. It provides examples of Python components that could be used for tasks like scheduling, auditing, file transport, loading, publishing, and transformations.
Speaker:
Alex Cruise (Dir. Architecture, Metafor Software)
Abstract:
The rise of the DevOps movement has brought into welcome focus something that is often learned only through painful experience and expense: the success of a software product critically depends not only on its implementation, maintenance and enhancement, but also on how it’s deployed and operated.
Distributed systems are hard, but you can’t escape them: you need to scale out, but wrapping proxy interfaces around remote resources so they look local is a recipe for a fragile system. Plus, as the complexity of components and services increases, local systems aren’t actually as reliable as we think! Concurrency is hard, but you can’t escape it: whether you’re using threads in a single process, or multiple processes on a single machine, you still need to synchronize state between them somehow. Fault tolerance is hard, but you can’t escape it: parts will fail, you need to cope without rebooting the whole application. Correctness is hard, but you can’t escape it: whether through laborious testing or a Sufficiently Advanced Compiler, you need to have some assurance that the software will work as intended.
Let’s talk about a set of architectural patterns (and, yes, frameworks) that can really help us achieve the goals of concurrency, fault tolerance and correctness, while affording us the flexibility we need to scale our deployments when we achieve terrifying success.
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
Review of paper
Language Models are Unsupervised Multitask Learners
(GPT-2)
by Alec Radford et al.
Paper link: https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
YouTube presentation: https://youtu.be/f5zULULWUwM
(Slides are written in English, but the presentation is done in Korean)
Big Data Survey/Handbook Summary Charts - 8 JULY 2013Lora Cecere
Big data is more of an opportunity than a problem -- 76% consider it an opportunity, 28% already have a big data initiative in place and another 37% plan to. The possibilities of big data are endless. It starts with thinking about how to use different data in a differentiated way.
The use of new forms of data is not an evolution. Instead, powering big data supply chains, and innovating through new forms of analytics, is a step change.
New forms of data do not fit traditional architectures. Traditional supply chains were architected to use structured data with software using relational databases. The big data era will make many of the investments from the last decade obsolete.
Big data offers the opportunity to redefine supply chain processes from the outside-in (from the channel back) and define the customer-centric supply chain. This is in stark contrast to the inflexible IT investments installed over the last decade to respond inside-out based on order shipments. These traditional investments in Enterprise Resource Planning (ERP), Advanced Planning Systems (APS) and traditional Business Intelligence (BI) for reporting, improved the supply chain response, but did not allow the organization to sense, shape or orchestrate outside-in. New forms of data (e.g., images, social data, sensor transmission, input from global positioning systems (GPS), the Internet of Things, and unstructured text from email, blogs and ratings and reviews) offer new opportunities. They also require new techniques and technologies.
Big data offers new opportunities for the corporation to listen, test and learn, and respond faster. In this study, companies see the greatest opportunity to use big data for “demand” (to better know the customer and improve the response); however, actual investments are in “supply” not “demand.” Respondents view supply-centric projects like product traceability (involving product serialization and traceability), supply chain visibility and temperature controlled handling as important.
Is big data a problem or a new market opportunity? Like the respondents of this survey, we believe that big data represents an opportunity for all. In the study, one-fourth of respondents currently have a big data initiative. However, interest is growing. Sixty-five percent have or plan to have a big data initiative in the future. Despite the hype, and the intensity of marketing rhetoric in the market, in our year-over-year studies on big data we see very little change in activity.
Despite the fact that the IT group is more likely to see big data as a problem, 49% of those with a big data initiative report that it is headed by an IT leader.
Big data represents a new opportunity, but seizing it requires a new form of leadership. It can ignite new business models and drive channel opportunities. However, it cannot be big data for big data itself. Instead, the initiatives need to be aligned to business objectives with a focus on small and iterative projects. It requires innovation. To move forward, companies need to embrace new technologies and redesign processes. It is not the case of stuffing new forms of data into old processes.
n this talk, Rsqrd welcomes Emad Elwany, CTO and Co-Founder of Lexion! He discusses his experiences with ML tooling and how it has evolved through the lifespan of Lexion, and shares his findings on important considerations, problems and solutions, and how decisions about ML tooling have changed over time through the stages of a startup.
**These slides are from a talk given at Rsqrd AI. Learn more at rsqrdai.org**
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)Amazon Web Services
Learn how to leverage new workflow management tools to simplify complex data pipelines and ETL jobs spanning multiple systems. In this technical deep dive from Treasure Data, company founder and chief architect walks through the codebase of DigDag, our recently open-sourced workflow management project. He shows how workflows can break large, error-prone SQL statements into smaller blocks that are easier to maintain and reuse. He also demonstrates how a system using ‘last good’ checkpoints can save hours of computation when restarting failed jobs and how to use standard version control systems like Github to automate data lifecycle management across Amazon S3, Amazon EMR, Amazon Redshift, and Amazon Aurora. Finally, you see a few examples where SQL-as-pipeline-code gives data scientists both the right level of ownership over production processes and a comfortable abstraction from the underlying execution engines. This session is sponsored by Treasure Data.
AWS Competency Partner
(Image on page 3: it's the traditional fast/good/cheap trade-off. Something glitched in the conversion))
The decisions we make in getting software ready to ship can have lasting consequences for later versions. Early priorities can end up setting the direction for the whole project.
My presentation from PyConAU 2012 (including bonus slides that were cut before the talk due to time limitations)
Speaker:
Alex Cruise (Dir. Architecture, Metafor Software)
Abstract:
The rise of the DevOps movement has brought into welcome focus something that is often learned only through painful experience and expense: the success of a software product critically depends not only on its implementation, maintenance and enhancement, but also on how it’s deployed and operated.
Distributed systems are hard, but you can’t escape them: you need to scale out, but wrapping proxy interfaces around remote resources so they look local is a recipe for a fragile system. Plus, as the complexity of components and services increases, local systems aren’t actually as reliable as we think! Concurrency is hard, but you can’t escape it: whether you’re using threads in a single process, or multiple processes on a single machine, you still need to synchronize state between them somehow. Fault tolerance is hard, but you can’t escape it: parts will fail, you need to cope without rebooting the whole application. Correctness is hard, but you can’t escape it: whether through laborious testing or a Sufficiently Advanced Compiler, you need to have some assurance that the software will work as intended.
Let’s talk about a set of architectural patterns (and, yes, frameworks) that can really help us achieve the goals of concurrency, fault tolerance and correctness, while affording us the flexibility we need to scale our deployments when we achieve terrifying success.
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
Review of paper
Language Models are Unsupervised Multitask Learners
(GPT-2)
by Alec Radford et al.
Paper link: https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
YouTube presentation: https://youtu.be/f5zULULWUwM
(Slides are written in English, but the presentation is done in Korean)
Big Data Survey/Handbook Summary Charts - 8 JULY 2013Lora Cecere
Big data is more of an opportunity than a problem -- 76% consider it an opportunity, 28% already have a big data initiative in place and another 37% plan to. The possibilities of big data are endless. It starts with thinking about how to use different data in a differentiated way.
The use of new forms of data is not an evolution. Instead, powering big data supply chains, and innovating through new forms of analytics, is a step change.
New forms of data do not fit traditional architectures. Traditional supply chains were architected to use structured data with software using relational databases. The big data era will make many of the investments from the last decade obsolete.
Big data offers the opportunity to redefine supply chain processes from the outside-in (from the channel back) and define the customer-centric supply chain. This is in stark contrast to the inflexible IT investments installed over the last decade to respond inside-out based on order shipments. These traditional investments in Enterprise Resource Planning (ERP), Advanced Planning Systems (APS) and traditional Business Intelligence (BI) for reporting, improved the supply chain response, but did not allow the organization to sense, shape or orchestrate outside-in. New forms of data (e.g., images, social data, sensor transmission, input from global positioning systems (GPS), the Internet of Things, and unstructured text from email, blogs and ratings and reviews) offer new opportunities. They also require new techniques and technologies.
Big data offers new opportunities for the corporation to listen, test and learn, and respond faster. In this study, companies see the greatest opportunity to use big data for “demand” (to better know the customer and improve the response); however, actual investments are in “supply” not “demand.” Respondents view supply-centric projects like product traceability (involving product serialization and traceability), supply chain visibility and temperature controlled handling as important.
Is big data a problem or a new market opportunity? Like the respondents of this survey, we believe that big data represents an opportunity for all. In the study, one-fourth of respondents currently have a big data initiative. However, interest is growing. Sixty-five percent have or plan to have a big data initiative in the future. Despite the hype, and the intensity of marketing rhetoric in the market, in our year-over-year studies on big data we see very little change in activity.
Despite the fact that the IT group is more likely to see big data as a problem, 49% of those with a big data initiative report that it is headed by an IT leader.
Big data represents a new opportunity, but seizing it requires a new form of leadership. It can ignite new business models and drive channel opportunities. However, it cannot be big data for big data itself. Instead, the initiatives need to be aligned to business objectives with a focus on small and iterative projects. It requires innovation. To move forward, companies need to embrace new technologies and redesign processes. It is not the case of stuffing new forms of data into old processes.
n this talk, Rsqrd welcomes Emad Elwany, CTO and Co-Founder of Lexion! He discusses his experiences with ML tooling and how it has evolved through the lifespan of Lexion, and shares his findings on important considerations, problems and solutions, and how decisions about ML tooling have changed over time through the stages of a startup.
**These slides are from a talk given at Rsqrd AI. Learn more at rsqrdai.org**
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)Amazon Web Services
Learn how to leverage new workflow management tools to simplify complex data pipelines and ETL jobs spanning multiple systems. In this technical deep dive from Treasure Data, company founder and chief architect walks through the codebase of DigDag, our recently open-sourced workflow management project. He shows how workflows can break large, error-prone SQL statements into smaller blocks that are easier to maintain and reuse. He also demonstrates how a system using ‘last good’ checkpoints can save hours of computation when restarting failed jobs and how to use standard version control systems like Github to automate data lifecycle management across Amazon S3, Amazon EMR, Amazon Redshift, and Amazon Aurora. Finally, you see a few examples where SQL-as-pipeline-code gives data scientists both the right level of ownership over production processes and a comfortable abstraction from the underlying execution engines. This session is sponsored by Treasure Data.
AWS Competency Partner
(Image on page 3: it's the traditional fast/good/cheap trade-off. Something glitched in the conversion))
The decisions we make in getting software ready to ship can have lasting consequences for later versions. Early priorities can end up setting the direction for the whole project.
My presentation from PyConAU 2012 (including bonus slides that were cut before the talk due to time limitations)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Brian Brazil
Prometheus is a next-generation monitoring system with a time series database at it's core. Once you have a time series database, what do you do with it though? This talk will look at getting data in, and more importantly how to use the data you collect productively.
Contact us at prometheus@robustperception.io
Performance doesn’t have the same definition between system administrators, developpers and business teams. What is Performance ? High CPU usage, not scalable web site, low business transaction rate per sec, slow response time, … This presentation is about maths, code performance, load testing, web performance, best practices, … Working on performance optimizaton is a very broad topic. It’s important to really understand main concepts and to have a clean and strong methodology because it could be a very time consumming activity. Happy reading !
Are we there Yet?? (The long journey of Migrating from close source to opens...Marco Tusa
Migrating from Oracle to MySQL or another Open source RDBMS like Postgres is not as straightforward as many think if not well guided. Check what it means doing with someone that has done it already.
The working architecture of node js applications open tech week javascript ...Viktor Turskyi
We launched more than 60 projects, developed a web application architecture that is suitable for projects of completely different sizes. In the talk, I'll analyze this architecture, will consider the question what to choose “monolith or microservices”, will show the main architectural mistakes that developers make.
Salesforce Flows Architecture Best Practicespanayaofficial
The use cases for Salesforce Flows are endless, and its capabilities are growing with every Salesforce release. It allows you to automate and optimize processes for every app, experience, and portal, and it gives Admins access to “code-like” functionality, without having to write a single line of code.
But with great power comes great responsibility! And when working with Flows you must ensure you are following some best practices to ensure your Flows are future proof and scalable.
Join Salesforce 8x Salesforce MVP, Rakesh Gupta and Oz Lavee, ForeSight CTO, as they share some golden rules and best practices for Flows Architecture.
During the webinar you will learn:
· Best practices and design tips to make sure your flows scale with your organization
· Key considerations for Salesforce automation
· Winter 22 highlights
How to Manage the Risk of your Polyglot EnvironmentsDevOps.com
In this webinar, we’ll explore how to navigate the tension between speed and security when it comes to open source languages.
Enterprises are challenged by conflicting interests:
Engineering teams want more time to focus on code quality, but product managers want to ship faster.
Developers want the best tool for the job, but companies resist adding more technology stacks to their growing tech debt.
Retrofitting for security and vulnerabilities after the fact becomes a big blocker for Development and Engineering teams. Enterprises are challenged with resolving new threats and vulnerabilities at the pace at which they crop up. And yet, speed wins over security because faster time-to-market takes a greater priority over fixing vulnerabilities.
Our expert panel will cover how to resolve the tension between speed and security by practices which:
Minimize DevOps overhead from retrofitting programming languages with new versions, dependencies, security patches, etc.
Enable Continuous Builds to keep up with your continuous deployments
Use Build Validation to vet your continuous builds against smoke tests
#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, CriteoParis Open Source Summit
#Data management & #Blockchain - Track - Data : database
Delivering a database service is not a simple job but to ensure that everything is working correctly your platform needs to be observable. In this talk, I’ll talk about how we make the MySQL/MariaDB databases observable. We’ll talk about the RED, USE methods, and the golden signals. You’ll discover how we dealt with the following questions “We think the database is slow”. This talk will allow you to make your databases discoverable with open source solutions.
Not my problem - Delegating responsibility to infrastructureYshay Yaacobi
Slides for for my talk, appeared on Code-Europe Poznan 12.06.2018
(https://www.codeeurope.pl/en/speakers/yshay-yaacobi)
https://github.com/yshayy/not-my-problem-talk
https://github.com/Yshayy/not-my-problem-talk/blob/master/slides/demo.md
ארגונים ברחבי העולם מגבירים את השימוש בתהליכי DevOps לטובת שיפור היתרון התחרותי שלהם, הורדת סיכונים והפחתת עלויות פיתוח. כיום ניתן ליישם את ההצלחה של ה-DevOps בעולם מסדי הנתונים, על ידי ביצוע אוטומציה של תהליכי הפיתוח והעברה בין סביבות, אכיפת מנגנוני אבטחה, והפחתת הסיכונים הכרוכים בתהליך.
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Brian Brazil
Prometheus is a next-generation monitoring system with a time series database at it's core. Once you have a time series database, what do you do with it though? This talk will look at getting data in, and more importantly how to use the data you collect productively.
Contact us at prometheus@robustperception.io
Performance doesn’t have the same definition between system administrators, developpers and business teams. What is Performance ? High CPU usage, not scalable web site, low business transaction rate per sec, slow response time, … This presentation is about maths, code performance, load testing, web performance, best practices, … Working on performance optimizaton is a very broad topic. It’s important to really understand main concepts and to have a clean and strong methodology because it could be a very time consumming activity. Happy reading !
Are we there Yet?? (The long journey of Migrating from close source to opens...Marco Tusa
Migrating from Oracle to MySQL or another Open source RDBMS like Postgres is not as straightforward as many think if not well guided. Check what it means doing with someone that has done it already.
The working architecture of node js applications open tech week javascript ...Viktor Turskyi
We launched more than 60 projects, developed a web application architecture that is suitable for projects of completely different sizes. In the talk, I'll analyze this architecture, will consider the question what to choose “monolith or microservices”, will show the main architectural mistakes that developers make.
Salesforce Flows Architecture Best Practicespanayaofficial
The use cases for Salesforce Flows are endless, and its capabilities are growing with every Salesforce release. It allows you to automate and optimize processes for every app, experience, and portal, and it gives Admins access to “code-like” functionality, without having to write a single line of code.
But with great power comes great responsibility! And when working with Flows you must ensure you are following some best practices to ensure your Flows are future proof and scalable.
Join Salesforce 8x Salesforce MVP, Rakesh Gupta and Oz Lavee, ForeSight CTO, as they share some golden rules and best practices for Flows Architecture.
During the webinar you will learn:
· Best practices and design tips to make sure your flows scale with your organization
· Key considerations for Salesforce automation
· Winter 22 highlights
How to Manage the Risk of your Polyglot EnvironmentsDevOps.com
In this webinar, we’ll explore how to navigate the tension between speed and security when it comes to open source languages.
Enterprises are challenged by conflicting interests:
Engineering teams want more time to focus on code quality, but product managers want to ship faster.
Developers want the best tool for the job, but companies resist adding more technology stacks to their growing tech debt.
Retrofitting for security and vulnerabilities after the fact becomes a big blocker for Development and Engineering teams. Enterprises are challenged with resolving new threats and vulnerabilities at the pace at which they crop up. And yet, speed wins over security because faster time-to-market takes a greater priority over fixing vulnerabilities.
Our expert panel will cover how to resolve the tension between speed and security by practices which:
Minimize DevOps overhead from retrofitting programming languages with new versions, dependencies, security patches, etc.
Enable Continuous Builds to keep up with your continuous deployments
Use Build Validation to vet your continuous builds against smoke tests
#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, CriteoParis Open Source Summit
#Data management & #Blockchain - Track - Data : database
Delivering a database service is not a simple job but to ensure that everything is working correctly your platform needs to be observable. In this talk, I’ll talk about how we make the MySQL/MariaDB databases observable. We’ll talk about the RED, USE methods, and the golden signals. You’ll discover how we dealt with the following questions “We think the database is slow”. This talk will allow you to make your databases discoverable with open source solutions.
Not my problem - Delegating responsibility to infrastructureYshay Yaacobi
Slides for for my talk, appeared on Code-Europe Poznan 12.06.2018
(https://www.codeeurope.pl/en/speakers/yshay-yaacobi)
https://github.com/yshayy/not-my-problem-talk
https://github.com/Yshayy/not-my-problem-talk/blob/master/slides/demo.md
ארגונים ברחבי העולם מגבירים את השימוש בתהליכי DevOps לטובת שיפור היתרון התחרותי שלהם, הורדת סיכונים והפחתת עלויות פיתוח. כיום ניתן ליישם את ההצלחה של ה-DevOps בעולם מסדי הנתונים, על ידי ביצוע אוטומציה של תהליכי הפיתוח והעברה בין סביבות, אכיפת מנגנוני אבטחה, והפחתת הסיכונים הכרוכים בתהליך.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
How world-class product teams are winning in the AI era by CEO and Founder, P...
Python for Data Logistics
1. Using Python for Data Logistics
Ken Farmer
Data Science and Business Analytics Meetup
http://www.meetup.com/Data-Science-Business-Analytics/events/120727322/
2013-06-25
2. About Data Logistics
My definition: Management of Data in Motion
Which includes: Extract, Transform, Validation,
Change Detection, Loading,
Summarizing, Aggration
(and some other stuff I don't care about*)
In Context: A part of every big data analytical project
Primary objective: Make analysis efficient & effective
* SOA, Enterprise Service Buses (ESB), Enterprise Application Integration (EAI) etc. But since this these don't drive big data
analytics we're not going to talk about them.
3. Data Logistics Characteristics
- there will be many flows
Note:
● There may be many sources of
any type of data
● There will be many different
source constraints – operating
systems, networks, etc
● There will be upstream
changes that will not be
communicated – you will just
see them in the data
Typical Large Security Data Warehouse
4. Data Logistics Characteristics
Side Note - this is why there are many flows
Lots of low-hanging fruitA year of data mining will
produce almost nothing
- or -
1 Feed 11 Feeds
So, which will produce the best analysis?
5. Data Logistics Characteristics
- and each flow can be complex
Parts not shown:
● File Movement
● Logging & Auditing & Alerting
● Process Monitoring
● Scheduling
Considerations not shown:
● Recovery
● Performance with High Volumes
● Management
6. Data Logistics Characteristics
- and there's no simple alternative
The Great Idea The Sad Reality
No delta processing ● Explodes data volumes
● Reduces functionality
No lookups ● Explodes data volumes
● Reduces reporting query performance
No dimensions ● Explodes data volumes
● Reduces reporting functionality
● Reduces reporting query performance
No validation ● Increases maintenance costs
● Increases reporting errors
No standardization ● Increases reporting costs
● Increases reporting errors
● Increased documentation costs
No management features ● Decreases reliability
● Increases maintenance costs
8. Nightmare #1 – Data Quality
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
0
10
20
30
40
50
60
70
80
90
100
ACME
Widget Production by Month
Month
Widgets
● Credibility
● Value
● Productivity
9. Nightmare #2 - Reliability
● Extended outages
● Frequent outages
● Missed SLAs
● Distractions from
new development
13. Root Cause #1 - magical thinking
There are no fairies
likewise there are no silver bullets
and your CRUD experience won't help you
14. Root Cause #2 - non-linear scalability
Gorillas don't scale gracefully
Neither will your feeds
he problem isn't performance -
It's maintenance. Dependencies,
cascading errors, and institutional
Knowledge.
15. Root Cause #3
– too much consistency or adaptability
These two conflicting forces are at odds
You need a balance
You have to have consistency
To help with learning curves
And organization.
You have to have adaptability
To get access to all the data
Sources you'll want.
16. ETL to the Rescue
- data logistics from the corporate world!
● The corporate world
started working on this 20
years ago
● It's still a hard problem, but
it's less of a nightmare
● Starting to make inroads
to Data Science/Big Data
projects
17. ETL
- Batch Pipelines not messages or
transactions
Data is batched
Feeds are organized like assembly or pipe lines
Each feed is broken into different programs / steps
18. ETL
- Most tools use diagram-driven-development
Which seems
great to almost all
management
And seems
pretty cool for a
while to some
developers
19. ETL
- Most tools use diagram-driven-development
But then
someone always
has to over do it
And we are
reminded that
tools are seldom
solutions
20. ETL
- So all is not wonderful
ETL – the last bastion of Computer-Aided Software Engineering (CASE) tools
Feature ETL
Tool
Custom
Code
Unit test harnesses no yes
TDD no yes
Version control flexibility no yes
Static code analysis no yes
Deployment tool
flexibility
no yes
Language flexibility no yes
Continuous Integration no yes
Virtual environments no yes
Diagrams yes yes
So, why don't we use
metadata-driven or code-
generation tools for
everything?
Why not use tools like
Frontpage for all websites?
21. ETL
- So, Buy (& Customize) vs Build
The ETL Tool Paradox:
● Programmers don't want to work on it
● But can only handle 80% of the problem without programming
Where the Buy option is a great fit:
● 100+ simple feeds
● Lack of programmer culture
● Standard already exists
Most typically – the “corporate data
warehouse” - a single database for an entire
company (usually a bad idea anyway)
22. Python
- a perfect fit for data logistics
● You can use the same language for
ETL, systems management and
data analysis
● The language is high-level and
maintenance-oriented
● It's easy for users to understand the
code
● It allows you use use all the
programming tools
● It's free
● It's a language for enthusiasts
● And it's fun
- http://xkcd.com/353/
23. Python
- Build List
For each Feed Application
● Program: Extract
● Program: Transform
● Config: File-Image Delta
● Config: Loader
● Config: File Mover
Services, Libraries and Utilities
● Service: metadata, auditing & logging,
dashboard
● Service: data movement
● Library: data validation
● Utility: file-image delta
● Utility: publisher
● Utility: loader
24. Python
- Typical Module List
Third-Party
● appdirs
● database drivers
● sqlalchemy
● pyyaml
● validictory
● requests
● envoy
● pytest
● virtualenv
● virtualenvwrapper
Standard Library
● os
● csv
● logging
● unittest
● collections
● argparse
● functools
Environmentals
● Version control – git, svn, etc
● Deployment – Fabric, Chef, etc
● Static analysis – pylint
● Testing – pytest, tox, buildbot, etc
● Documentation - sphinx
Bottom line: a mostly vanilla and very free environment will get you very far
25. Python ETL Components
- Scheduling
● Typically cron
● Daemon if you want more than one run > minute
● Should have suppression capability beyond commenting
out the cron job
● Event-driven > temporally-driven
● Need checking for more than one instance running
● Level of effort: very little
26. Python ETL Components
- Audit System
● Analyze performance & rule issues over time
● Centralize alerting
● Level of effort: weeks
27. Python ETL Components
- File Transporter
File movement is extremely failure-prone:
- out of space errors
- permission errors
- credential expiration errors
- network errors
So, use a process external to feed processing to move files – and
simplify their recovery.
Note this is not the same as data mirroring:
- moves files from source to destination
- renames file during movement
- moves/deletes/renames source after move
- So, you may need to write this yourself – rsync is not ideal
Level of Effort: pretty simple, 1-3 weeks to write reusable utility
28. Python ETL Components
- Load Utility
Functionality
● Validates data
● Continuously loads
● Moves files as necessary
● May run delta operation
● Handles recoveries
● Writes to audit tables
Bottom line: pretty simple, 1-3 weeks to write reusable utility
29. Python ETL Components
- Publish Utility
Functionality
● Extracts all data since the last time it ran
● Can handle max rows
● Moves files as necessary
● Handles recoveries
● Writes to audit tables
● Writes all data to a compressed tarball
Bottom line: pretty simple, 1-3 weeks to write reusable utility
30. Python ETL Components
- Delta Utility
Functionality
● Like diff – but for structured files
● Distinguishes between key fields vs non-key fields
● Can be configured to skip comparisons of certain fields
● Can perform minor transformations
● May be built into Load utility, or a transformation library
Bottom line: pretty simple, 1-3 weeks to write reusable utility
31. Python Program
- Simple Transform
def transform_gender(input_gender):
“”” Transforms a gender code to the standard format.
:param input_gender – in either VARCOPS or SITHOUSE formats
:returns standard gender code
“””
if input_gender.lower() in ['m','male','1','transgender_to_male']:
output_gender = 'male'
elif input_gender.lower() in ['f','female','2','transgender_to_female']:
output_gender = 'female'
elif input_gender.lower() in ['transsexual','intersex']:
output_gender = 'transgender'
else:
output_gender = 'unknown'
return output_gender
Observation:
Simple transforms &
Rules can be easily
read by non-programmers.
Observation: Transforms
can be kept in a module
and easily documented.
Observation:
Even simple transforms
Can have a lot of subtleties.
And are likely to be referenced
Or changed by users.
32. Python Program
- Complex Transformation
def explode_ip_range_list(ip_range_list):
“”” Transforms an ip range list to a list of individual ip addresses.
:param ip_range_list – comma or space delimited ip ranges
or ips. Ranges are separated with a dash, or use CIDR notation.
Individual IP addresses can be represented with a dotted quad,
integer (unsigned), hex or CIDR notation.
ex: "10.10/16, 192.168.1.0 - 192.168.1.255, 192.168.2.3,
192.168.3.5 – 192.168.5.10, 192.168.5, 0.0.0.0/1"
“””
output_ip_list = []
for ip in whitelist.ip_expansion(ip_range_list):
output_ip_list.append(ip)
return output_ip_list
Ok, this is a cheat – the complexity is in the library
Observation:
Complex transforms would
That would be a nightmare in
A tool can be easy in Python.
Especially, as in this case, when
There's a great module to use.
Observation:
Unit-testing frameworks
Are incredibly valuable
For complex transforms.
34. The Bottom Line
Thank You – Any Questions?
The Good:
● Python for attracting & retaining developers
● Python for handling complexity
● Python for costs
● Python for adaptability
● Python for modern development environment
The Not Good:
● Lack of good practices adds risk
● Lack of a rigid framework requires discipline
The Tangential:
● Hadoop – who said anything about hadoop?
Editor's Notes
About me: - 20 years working on data logistics for big data Projects on variety of clients - 10+ years working with python on data logistics - live in manitou springs - currently work for IBM as a data architect responsible for their security data warehouse - presented on this topic often
I didn't say “Big Data Project” - a big social networking site with 1 PB of content may not be doing as much analysis – may not require as many feeds Many would say this is the hardest part of data science Many would say this can consume 90% of a data science budget
As I'll get to in the next slide – you will probably have ***many*** feeds This shows an ideal security data warehouse set of feeds 24 FEEDS – but it could really be > 50
Firewall only - stuck with looking for patterns - might identify scans - might identify recon - will miss all distributed attacks Firewall+ - can tell if a scan came from a whitelist - can see if activity involves known bad guys - can see if activity involves high-value, Or vulnerable assets
Acknowledgements to Mike Koenig, and Drum 8. “An Upsetting Theme” by Kevin MacLeod. Licensed under Creative Commons “Attribution 3.0″ http://creativecommons.org/licenses/by/3.0/ and used here by permission, and with appreciation and thanks. Herbert Morrison’s on-the-scene recordings of the Disaster are Public Domain. Thanks to http://www.americanrhetoric.com for access.
Above example – problem won't disappear for 11 months. Users will be reminded of problem until it does. This is unlike transactional system, in which evidence of problems is hidden. Quality problems are one of the top reasons for analytical system failure. Examples: - Country threaten to go to UN if my company didn't retract an apology for its wrong analysis based on my data. Pretty intense.
Source systems won't tell you of changes they've made Many business complex feeds to maintain
http://creativecommons.org/licenses/by/2.0/deed.en Examples: - A system I'm familiar with is spending 4x what we're spending on hardware & support and loads 1/8000 our speed.
http://creativecommons.org/licenses/by-nc-sa/2.0/deed.en http://www.flickr.com/photos/slworking/5328601506/ You could eventually paint yourself into a corner – in which the maintenance of your feeds is nearly impossible to keep up with. Examples: - I know of some systems that take 6 months to build feeds. Others that can do the exact same feed in 1 month. -
foo
- ETL Tools aren't silver bullets - XML isn't a silver bullet - Your experience building transactional systems won't help you This is not your world, it's your father's world. It's the world of mainframe batch systems from the 60s & 70s: - Few streams - Web services are too slow for the big feeds - No fat object layers - No record-by-record transactions + batch processing + bulk loading + merging of files
Gorillas don't scale – king kong couldn't exist because the square-cube law would require his bones to be disproportionately larger in cross-section at that size. Likewise, the work to build and maintain 50 feeds is more than 50x the work to do 1: Overhead services become more important – and take up more time Feeds have interdependencies Plus, they don't age terribly well – as you discover that upstream systems make changes, say annually, without telling you.
You need consistency to keep maintenance costs low. Too much inconsistency and you'll have an unmaintainable nightmare. But you need adaptability to work around source system requirements. Too much consistency here and you'll be unable to add new data. Ex: - you may have to use a client library in some other language - you may have to use RSS, SSL, RMI, etc - you may have an extract on the other side of a firewall
These two worlds just don't talk much. Especially since most ETL solutions have been closed source – it's a domain that's invisible to open source projects. Plus, ETL just isn't SEXY. Now that Big Data projects are happening in Corporate environments, and open source ETL is getting coverage – it's getting more visibility.
From http://professional.robertbui.com/2009/10/kettle-cuts-80-off-data-extraction-transformation-and-loading/ And most solutions involve diagraming your feed, and the solution either: - generates code - runs metadata through an engine
From http://professional.robertbui.com/2009/10/kettle-cuts-80-off-data-extraction-transformation-and-loading/
CASE tools were pretty much abandoned by the mid-90s But not for ETL – since its main adherents were those that didn't program much anyway So, they've lingered. And so has the myth that ETL is too hard to write by hand. In the late 90s the Meta Group released a study that showed that COBOL programmers were more productive than the users of any ETL software.
My apologies to the Ruby guys who are all sick of this cartoon by now