Intro talk for UNC School of Information and Library Science. Covers basics of Lucene and Solr as well as info on Lucene/Solr jobs, opportunities, etc.
Exceptions are the Norm: Dealing with Bad Actors in ETLDatabricks
Stable and robust data pipelines are a critical component of the data infrastructure of enterprises. Most commonly, data pipelines ingest messy data sources with incorrect, incomplete or inconsistent records and produce curated and/or summarized data for consumption by subsequent applications.
In this talk, we go over new and upcoming features in Spark that enabled it to better serve such workloads. Such features include isolation of corrupt input records and files, useful diagnostic feedback to users and improved support for nested type handling which is common in ETL jobs.
Speaker: Sameer Agarwal
This talk was originally presented at Spark Summit East 2017.
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
“As Apache Spark becomes more widely adopted, we have focused on creating higher-level APIs that provide increased opportunities for automatic optimization. In this talk, I give an overview of some of the exciting new API’s available in Spark 2.0, namely Datasets and Structured Streaming. Together, these APIs are bringing the power of Catalyst, Spark SQL's query optimizer, to all users of Spark. I'll focus on specific examples of how developers can build their analyses more quickly and efficiently simply by providing Spark with more information about what they are trying to accomplish.” - Michael
Databricks Blog: "Deep Dive into Spark SQL’s Catalyst Optimizer"
https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
// About the Presenter //
Michael Armbrust is the lead developer of the Spark SQL project at Databricks. He received his PhD from UC Berkeley in 2013, and was advised by Michael Franklin, David Patterson, and Armando Fox. His thesis focused on building systems that allow developers to rapidly build scalable interactive applications, and specifically defined the notion of scale independence. His interests broadly include distributed systems, large-scale structured storage and query optimization.
Follow Michael on -
Twitter: https://twitter.com/michaelarmbrust
LinkedIn: https://www.linkedin.com/in/michaelarmbrust
Enabling exploratory data science with Spark and RDatabricks
R is a favorite language of many data scientists. In addition to a language and runtime, R is a rich ecosystem of libraries for a wide range of use cases from statistical inference to data visualization. However, handling large datasets with R is challenging, especially when data scientists use R with frameworks or tools written in other languages. In this mode most of the friction is at the interface of R and the other systems. For example, when data is sampled by a big data platform, results need to be transferred to and imported in R as native data structures. In this talk we show how SparkR solves these problems to enable a much smoother experience. In this talk we will present an overview of the SparkR architecture, including how data and control is transferred between R and JVM. This knowledge will help data scientists make better decisions when using SparkR. We will demo and explain some of the existing and supported use cases with real large datasets inside a notebook environment. The demonstration will emphasize how Spark clusters, R and interactive notebook environments, such as Jupyter or Databricks, facilitate exploratory analysis of large data.
Intro talk for UNC School of Information and Library Science. Covers basics of Lucene and Solr as well as info on Lucene/Solr jobs, opportunities, etc.
Exceptions are the Norm: Dealing with Bad Actors in ETLDatabricks
Stable and robust data pipelines are a critical component of the data infrastructure of enterprises. Most commonly, data pipelines ingest messy data sources with incorrect, incomplete or inconsistent records and produce curated and/or summarized data for consumption by subsequent applications.
In this talk, we go over new and upcoming features in Spark that enabled it to better serve such workloads. Such features include isolation of corrupt input records and files, useful diagnostic feedback to users and improved support for nested type handling which is common in ETL jobs.
Speaker: Sameer Agarwal
This talk was originally presented at Spark Summit East 2017.
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
“As Apache Spark becomes more widely adopted, we have focused on creating higher-level APIs that provide increased opportunities for automatic optimization. In this talk, I give an overview of some of the exciting new API’s available in Spark 2.0, namely Datasets and Structured Streaming. Together, these APIs are bringing the power of Catalyst, Spark SQL's query optimizer, to all users of Spark. I'll focus on specific examples of how developers can build their analyses more quickly and efficiently simply by providing Spark with more information about what they are trying to accomplish.” - Michael
Databricks Blog: "Deep Dive into Spark SQL’s Catalyst Optimizer"
https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
// About the Presenter //
Michael Armbrust is the lead developer of the Spark SQL project at Databricks. He received his PhD from UC Berkeley in 2013, and was advised by Michael Franklin, David Patterson, and Armando Fox. His thesis focused on building systems that allow developers to rapidly build scalable interactive applications, and specifically defined the notion of scale independence. His interests broadly include distributed systems, large-scale structured storage and query optimization.
Follow Michael on -
Twitter: https://twitter.com/michaelarmbrust
LinkedIn: https://www.linkedin.com/in/michaelarmbrust
Enabling exploratory data science with Spark and RDatabricks
R is a favorite language of many data scientists. In addition to a language and runtime, R is a rich ecosystem of libraries for a wide range of use cases from statistical inference to data visualization. However, handling large datasets with R is challenging, especially when data scientists use R with frameworks or tools written in other languages. In this mode most of the friction is at the interface of R and the other systems. For example, when data is sampled by a big data platform, results need to be transferred to and imported in R as native data structures. In this talk we show how SparkR solves these problems to enable a much smoother experience. In this talk we will present an overview of the SparkR architecture, including how data and control is transferred between R and JVM. This knowledge will help data scientists make better decisions when using SparkR. We will demo and explain some of the existing and supported use cases with real large datasets inside a notebook environment. The demonstration will emphasize how Spark clusters, R and interactive notebook environments, such as Jupyter or Databricks, facilitate exploratory analysis of large data.
Apache LuceneTM is a free open-source , high-performance, full-featured text search engine library that has been written completely in Java. As a technology is best suited for any application that requires full-text search, especially cross-platform.
From Pipelines to Refineries: Scaling Big Data ApplicationsDatabricks
Big data tools are challenging to combine into a larger application: ironically, big data applications themselves do not tend to scale very well. These issues of integration and data management are only magnified by increasingly large volumes of data.
Apache Spark provides strong building blocks for batch processes, streams and ad-hoc interactive analysis. However, users face challenges when putting together a single coherent pipeline that could involve hundreds of transformation steps, especially when confronted by the need of rapid iterations.
This talk explores these issues through the lens of functional programming. It presents an experimental framework that provides full-pipeline guarantees by introducing more laziness to Apache Spark. This framework allows transformations to be seamlessly composed and alleviates common issues, thanks to whole program checks, auto-caching, and aggressive computation parallelization and reuse.
Thorough introduction to language parsing in C#, overview of different approaches, and live-coding session that showcases how to build a working JSON parser using Sprache
Supporting material for hadoop-example (GitHub repository). You can run the codes on your local computer, and then it will help you understand how its MapReduce works.
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
Apache Spark has grown to be one of the largest open source communities in big data, with over 190 developers and dozens of companies contributing. The latest 1.0 release alone includes contributions from 117 people. A clean API, interactive shell, distributed in-memory computation, stream processing, interactive SQL, and libraries delivering everything from machine learning to graph processing make it an excellent unified platform to solve a number of problems. Apache Spark works very well with a growing number of big data solutions, including Cassandra and Hadoop. Come learn about Apache Spark and see how easy it is for you to get started using Spark to build your own high performance big data applications today.
Parsl: Pervasive Parallel Programming in PythonDaniel S. Katz
a seminar presented at the School of Computer Science at the University of St Andrews 18 October 2019 (see https://blogs.cs.st-andrews.ac.uk/csblog/2019/09/25/daniel-katz-parsl/)
Script of ScriptsPolyglot Notebook and Workflow SystemBo Peng
Script of ScriptsPolyglot Notebook and Workflow System for both Interactive Multi-language Data Analysis and Batch Data Processing. Video at https://youtu.be/U75eKosFbp8
Apache LuceneTM is a free open-source , high-performance, full-featured text search engine library that has been written completely in Java. As a technology is best suited for any application that requires full-text search, especially cross-platform.
From Pipelines to Refineries: Scaling Big Data ApplicationsDatabricks
Big data tools are challenging to combine into a larger application: ironically, big data applications themselves do not tend to scale very well. These issues of integration and data management are only magnified by increasingly large volumes of data.
Apache Spark provides strong building blocks for batch processes, streams and ad-hoc interactive analysis. However, users face challenges when putting together a single coherent pipeline that could involve hundreds of transformation steps, especially when confronted by the need of rapid iterations.
This talk explores these issues through the lens of functional programming. It presents an experimental framework that provides full-pipeline guarantees by introducing more laziness to Apache Spark. This framework allows transformations to be seamlessly composed and alleviates common issues, thanks to whole program checks, auto-caching, and aggressive computation parallelization and reuse.
Thorough introduction to language parsing in C#, overview of different approaches, and live-coding session that showcases how to build a working JSON parser using Sprache
Supporting material for hadoop-example (GitHub repository). You can run the codes on your local computer, and then it will help you understand how its MapReduce works.
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
Apache Spark has grown to be one of the largest open source communities in big data, with over 190 developers and dozens of companies contributing. The latest 1.0 release alone includes contributions from 117 people. A clean API, interactive shell, distributed in-memory computation, stream processing, interactive SQL, and libraries delivering everything from machine learning to graph processing make it an excellent unified platform to solve a number of problems. Apache Spark works very well with a growing number of big data solutions, including Cassandra and Hadoop. Come learn about Apache Spark and see how easy it is for you to get started using Spark to build your own high performance big data applications today.
Parsl: Pervasive Parallel Programming in PythonDaniel S. Katz
a seminar presented at the School of Computer Science at the University of St Andrews 18 October 2019 (see https://blogs.cs.st-andrews.ac.uk/csblog/2019/09/25/daniel-katz-parsl/)
Script of ScriptsPolyglot Notebook and Workflow SystemBo Peng
Script of ScriptsPolyglot Notebook and Workflow System for both Interactive Multi-language Data Analysis and Batch Data Processing. Video at https://youtu.be/U75eKosFbp8
Donald Miner will do a quick introduction to Apache Hadoop, then discuss the different ways Python can be used to get the job done in Hadoop. This includes writing MapReduce jobs in Python in various different ways, interacting with HBase, writing custom behavior in Pig and Hive, interacting with the Hadoop Distributed File System, using Spark, and integration with other corners of the Hadoop ecosystem. The state of Python with Hadoop is far from stable, so we'll spend some honest time talking about the state of these open source projects and what's missing will also be discussed.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/16WDJ8b.
Eli Collins overviews how to build new applications with Hadoop and how to integrate Hadoop with existing applications, providing an update on the state of Hadoop ecosystem, frameworks and APIs.Filmed at qconnewyork.com.
Eli Collins is the tech lead for Cloudera's Platform team, an active contributor to Apache Hadoop and member of its project management committee (PMC) at the Apache Software Foundation. Eli holds Bachelor's and Master's degrees in Computer Science from New York University and the University of Wisconsin-Madison, respectively.
Alternatives to Apache Accumulo’s Java APIJosh Elser
Talk from Accumulo Summit 2015. Covers other projects which allows interactions with Apache Accumulo. Defines a model for why you would choose one over another and evaluates the strengths and weaknesses of each.
Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]Accumulo Summit
Talk Abstract
A common tradeoff made by fault-tolerant, distributed systems is the ease of user interaction with the system. Implementing correct distributed operations in the face of failures often takes priority over reducing the level of effort required to use the system. Because of this, applying a problem in a specific domain to the system can require significant planning and effort by the user. Apache Accumulo, and its sorted, Key-Value data model, is subject to this same problem: it is often difficult to use Accumulo to quickly ascertain real-life answers about some concrete problem.
This problem, not unique to Accumulo itself, has spurred the growth of numerous projects to fill these kinds of gaps in usability, in addition to multiple language bindings provided by applications. Outside of the Java API, Accumulo client support varies from programming languages, like Python or Ruby, to standalone projects that provide their own query language, such as Apache Pig and Apache Hive. This talk will cover the state of client support outside of Accumulo’s Java API with an emphasis on the pros, cons, and best practices of each alternative.
Speaker
Josh Elser
Member of Technical Staff, Hortonworks
Josh is a member of the engineering staff at Hortonworks. He is strong advocate for open source software and is an Apache Accumulo committer and PMC member. He is also a committer and PMC member of Apache Slider (incubating) and regularly contributes to other Apache projects in the Apache Hadoop ecosystem. He holds a Bachelor's degree in Computer Science from Rensselaer Polytechnic Institute.
Introduction to Hadoop.
What are Hadoop, MapReeduce, and Hadoop Distributed File System.
Who uses Hadoop?
How to run Hadoop?
What are Pig, Hive, Mahout?
Leveraging NLP and Deep Learning for Document Recommendations in the CloudDatabricks
Efficient recommender systems are critical for the success of many industries, such as job recommendation, news recommendation, ecommerce, etc. This talk will illustrate how to build an efficient document recommender system by leveraging Natural Language Processing(NLP) and Deep Neural Networks (DNNs). The end-to-end flow of the document recommender system is build on AWS at scale, using Analytics Zoo for Spark and BigDL. The system first processes text rich documents into embeddings by incorporating Global Vectors (GloVe), then trains a K-means model using native Spark APIs to cluster users into several groups. The system further trains a recommender model for each group, and gives an ensemble prediction for each test record. By adopting the end-to-end pipeline of Analytics Zoo solution, we saw about 10% improvement of mean reciprocal ranking and 6% of precision respectively compared to the search recommendations for a job recommendation study.
Speaker: Guoqiong Song
A talk presented to the US Networking and Information Technology Research and Development (NITRD) Program's High End Computing Interagency Working Group, 16 January 2020
(a slightly updated version of this talk is at https://doi.org/10.6084/m9.figshare.10301741.v1)
A talk on the role of software in research and how NCSA is responding in terms of people and roles - given at the 2019 Data Science Leadership Summit (https://sites.google.com/msdse.org/datascienceleadership2019/).
This is partially based on a previous paper: Daniel S. Katz, Kenton McHenry, Caleb Reinking, Robert Haines, "Research Software Development & Management in Universities: Case Studies from Manchester's RSDS Group, Illinois' NCSA, and Notre Dame's CRC", 2019 IEEE/ACM 14th International Workshop on Software Engineering for Science (SE4Science)
doi: https://doi.org/10.1109/SE4Science.2019.00009
preprint: https://arxiv.org/abs/1903.00732
Requiring Publicly-Funded Software, Algorithms, and Workflows to be Made Publ...Daniel S. Katz
A presentation made to OECD's Committee for Scientific and Technological Policy (CSTP) at the Workshop on the Revision of the Recommendation of the Council concerning Access to Research Data from Public Funding, 15 October 2019
How different groups think about software sustainability, what "equations" we might use to measure it, and how it really can't be measured looking forward but only predicted.
Slides for:
"Software Citation in Theory and Practice," by Daniel S. Katz and Neil P. Chue Hong (published paper - https://doi.org/10.1007/978-3-319-96418-8_34; preprint - https://arxiv.org/abs/1807.08149), presented at International Congress on Mathematical Software (ICMS 2018)
Abstract. In most fields, computational models and data analysis have become a significant part of how research is performed, in addition to the more traditional theory and experiment. Mathematics is no exception to this trend. While the system of publication and credit for theory and experiment (journals and books, often monographs) has developed and has become an expected part of the culture, how research is shared and how candidates for hiring, promotion are evaluated, software (and data) do not have the same history. A group working as part of the FORCE11 community developed a set of principles for software citation that fit software into the journal citation system, allow software to be published and then cited, and there are now over 50,000 DOIs that have been issued for software. However, some challenges remain, including: promoting the idea of software citation to developers and users; collaborating with publishers to ensure that systems collect and retain required metadata; ensuring that the rest of the scholarly infrastructure, particu- larly indexing sites, include software; working with communities so that software efforts count; and understanding how best to cite software that has not been published.
A talk about "Conceptualizing a US Research Software Sustainability Institute (URSSI)" presented at the Toward a New Computational Fluid Dynamics Software Infrastructure (CFDSI, https://www.colorado.edu/events/cfdsi/) workshop in Boulder, CO, 16 May 2018.
A brief status of software citation work presented at AAS splinter meeting on implementing the FORCE11 Software Citation Principles in Astronomy (2018-01-11)
A talk about citation and reproducibility in software, presented at the HSF (High Energy Physics Software Foundation) meeting at SDSC, San Diego, CA, USA, 23 January 2017
Based on citation work done by the FORCE11 Software Citation Working Group as well as recent reproducibility discussions, blogs, and papers
Software Citation: Principles, Implementation, and ImpactDaniel S. Katz
A talk about Software Citation Principles for the 3:am conference (Bucharest, Romania, 28 September 2016), as developed by Arfon M. Smith, Daniel S. Katz, Kyle E. Niemeyer, and the FORCE11 Software Citation Working Group
A talk about the "Working towards Sustainable Software for Science: Practice and Experience (WSSSPE)" community/theme/set of workshop, focused on WSSSPE3, the working groups that were formed there, how they have developed from activities in previous WSSSPE3 meetings, and their current status.
This talk was given as a Dagstuhl meeting on Engineering Academic Software (http://www.dagstuhl.de/en/program/calendar/semhp/?semnr=16252) 20 June 2016.
Working towards Sustainable Software for Science: Practice and Experience (WS...Daniel S. Katz
This was a short talk about the WSSSPE events, given at the Dagstuhl workshop on Engineering Academic Software, 20 June 2016. It mostly discusses the working groups that have formed gradually over the WSSSPE meetings, and specifically those that worked through WSSSPE3, and what that have done since then.
Discussing Software Citation and related topics at Workshop on Data and Software Citation (June 6-7 at Harvard Medical School, http://www.software4data.com/#!nsf-workshop/jghgb)
Scientific Software Challenges and Community ResponsesDaniel S. Katz
a talk given at RTI International on 7 December 2015, discussing 12 scientific software challenges and how the scientific software community is responding to them
Looking at Software Sustainability and Productivity Challenges from NSFDaniel S. Katz
A lightning talk by Daniel S. Katz and Rajiv Ramnath (NSF) at CSESSP workshop - https://www.nitrd.gov/csessp/
Based on a white paper, at http://arxiv.org/abs/1508.03348
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Expressing and sharing workflows
1. National Center for Supercomputing Applications
University of Illinois at Urbana–Champaign
Expressing and sharing workflows
Daniel S. Katz
Assistant Director for Scientific Software & Applications, NCSA
Research Associate Professor, CS
Research Associate Professor, ECE
Research Associate Professor, iSchool
dskatz@illinois.edu, d.katz@ieee.org
@danielskatz
2. What’s a workflow?
• A set of tasks and dependencies between them
• Perhaps expressed as data structure, e.g. graph (DAG or cyclic)
• How is this different than a computer program?
• The tasks as more well-defined (inputs, outputs)
• The tasks are longer (running time O(sec) – O(hr))
• Why express it differently?
• Program (script) is a natural way of expressing a workflow
• Examples: shell scripts, programs in Swift/Parsl
• YesWorkflow annotations to help in understanding scripts
• Swift/Parsl: functions used to identify components
• Expressing it as data corresponds to the compiled (assembly)
version of the workflow
• Useful for a lot of things, but not understanding
http://swift-lang.org, http://parsl-project.org
3. YesWorkflow (YW)
• Name: “Yes, scripts are (can be) workflows, too!”
• But, workflow (dataflow) usually hidden in the script
• Idea: let the script author reveal the structure by
declaring tasks (steps) and dataflow between tasks.
• This is a modeling step
• very coarse (workflow: one big black box w/ inputs & outputs)
• or rather fine (workflow has many steps, linked by dataflow)
• => language to explain (graphically) what the concepts
(relevant steps, relevant data) you want to share
• => this conceptual YW model can itself be queried; linked
with runtime observables, provenance
Credit: Bertram Ludäscher
4. Parsl
• A python-based parallel scripting library (http://parsl-project.org),
based on ideas in Swift (http://swift-lang.org)
• Tasks exposed as functions (python or bash)
@App('bash', data_flow_kernel)
def echo(message, outputs=[]):
return 'echo {0} &> {outputs[0]}’
@App('python', data_flow_kernel)
def cat(inputs=[]):
with open(inputs[0]) as f:
return f.readlines()
• Return values are futures
• Other tasks can be called that depend on these futures
• Will not run until futures are satisfied/filled
• Main code used to glue functions together
hello = echo("Hello World!", outputs=['hello1.txt'])
message = cat(inputs=[hello.outputs[0]])
• Fairly easy to understand
5. How to promote/share workflows
• How do we share general software?
• Libraries (units of execution with well-defined APIs)
• Source code (fork model)
• Source code repositories (GitHub), packaging
systems/repositories (PyPI, CRAN)
• How do we share data?
• Repositories (Dryad)
• For workflows
• Libraries -> sub-workflows, defined to provide well-specified
functionality
• Source code -> source code (scripts), may still be hard to
understand
• Data -> data repository for workflows (MyExperiment)
6. www.myexperiment.org
De Roure, D., Goble, C. Stevens, R. (2009) The Design
and Realisation of the myExperiment Virtual Research
Environment for Social Sharing of Workflows. Future
Generation Computer Systems 25, pp. 561-7.
• A workflow commons for workflow sharing,
designed using Web 2.0 principles
• Launched open beta in November 2007, still
actively used
• Largest public collection of workflows, for
multiple workflow systems
• 2400+ entries in Google Scholar refer to
myExperiment
• Open source, REST API, part of Open Linked
Data cloud (66k triples) - lod-cloud.net
• Introduced “packs” which led to Research
Objects – www.researchobject.org
• Workflow collection studied in scientific
workflow and e-Science communities
• Service maintained by Manchester and Oxford
universities. Informs design of other workflow
sharing systems.
• Content stats: 10591 members, 393 groups,
3876 workflows, 1233 files, 477 packs
Credit: Carole Goble
7. GitHub
• Widely used for sharing software, and socially working
on/with software (and many other types of documents)
• GitHub is used for sharing workflows today
• Both scripts and data
• Borrowing from “Software vs. data in the context of
citation”
• A workflow as a program or a script is code, a creative work
• Appropriate license: OSI-approved open source (e.g., BSD)
• A workflow as a DAG is data?
• Appropriate license: Creative Commons (e.g., CC-BY)?
• So, let’s keep workflows as programs/scripts
• Use YesWorkflow with scripts
• Use GitHub to share
Katz DS, Niemeyer KE, et al. (2016) Software vs. data in the context of citation.
PeerJ Preprints 4:e2630v1 doi: 10.7287/peerj.preprints.2630v1