The document discusses projects for working with Resource Description Framework (RDF) data in the Hadoop ecosystem. It describes Apache Jena Elephas, a set of modules that enable RDF on Hadoop by providing Writable types for RDF primitives and input/output support. It also discusses Intel Graph Builder, which allows graphs to be created or transformed from data sources using Apache Pig. The document encourages participants to try out these projects and contribute by suggesting features, reporting issues, or contributing code.
Semantic Integration with Apache Jena and StanbolAll Things Open
All Things Open 2014 - Day 1
Wednesday, October 22nd, 2014
Phillip Rhodes
Founder & President of Fogbeam Labs
Big Data
Semantic Integration with Apache Jena and Stanbol
A talk given at SemTechBiz 2014 in San Jose that follows up on the tool originally presented at the 2012 conference. Talks about the limitations we've encountered with the original tool and how we've evolved it to address these and build a more robust general purpose and open source SPARQL testing tool.
The tool is available on SourceForge in pre-built form at http://sourceforge.net/projects/sparql-query-bm/ or as code on SourceForge or GitHub (https://github.com/rvesse/sparql-query-bm)
Quadrupling your elephants - RDF and the Hadoop ecosystemRob Vesse
Presentation given at ApacheCon EU 2014 in Budapest on technologies aiming to bridge the gap between the RDF and the Hadoop ecosystems.
Talks primarily about RDF Tools for Hadoop (part of the Apache Jena) project and Intel Graph Builder (extensions to Pig)
Sempala - Interactive SPARQL Query Processing on HadoopAlexander Schätzle
Driven by initiatives like Schema.org, the amount of semantically annotated data is expected to grow steadily towards massive scale, requiring cluster-based solutions to query it. At the same time, Hadoop has become dominant in the area of Big Data processing with large infrastructures being already deployed and used in manifold application fields. For Hadoop-based applications, a common data pool (HDFS) provides many synergy benefits, making it very attractive to use these infrastructures for semantic data processing as well.
Indeed, existing SPARQL-on-Hadoop (MapReduce) approaches have already demonstrated very good scalability, however, query runtimes are rather slow due to the underlying batch processing framework. While this is acceptable for data-intensive queries, it is not satisfactory for the majority of SPARQL queries that are typically much more selective requiring only small subsets of the data.
In this paper, we present Sempala, a SPARQL-over-SQL-on-Hadoop approach designed with selective queries in mind. Our evaluation shows performance improvements by an order of magnitude compared to existing approaches, paving the way for interactive-time SPARQL query processing on Hadoop.
Semantic Integration with Apache Jena and StanbolAll Things Open
All Things Open 2014 - Day 1
Wednesday, October 22nd, 2014
Phillip Rhodes
Founder & President of Fogbeam Labs
Big Data
Semantic Integration with Apache Jena and Stanbol
A talk given at SemTechBiz 2014 in San Jose that follows up on the tool originally presented at the 2012 conference. Talks about the limitations we've encountered with the original tool and how we've evolved it to address these and build a more robust general purpose and open source SPARQL testing tool.
The tool is available on SourceForge in pre-built form at http://sourceforge.net/projects/sparql-query-bm/ or as code on SourceForge or GitHub (https://github.com/rvesse/sparql-query-bm)
Quadrupling your elephants - RDF and the Hadoop ecosystemRob Vesse
Presentation given at ApacheCon EU 2014 in Budapest on technologies aiming to bridge the gap between the RDF and the Hadoop ecosystems.
Talks primarily about RDF Tools for Hadoop (part of the Apache Jena) project and Intel Graph Builder (extensions to Pig)
Sempala - Interactive SPARQL Query Processing on HadoopAlexander Schätzle
Driven by initiatives like Schema.org, the amount of semantically annotated data is expected to grow steadily towards massive scale, requiring cluster-based solutions to query it. At the same time, Hadoop has become dominant in the area of Big Data processing with large infrastructures being already deployed and used in manifold application fields. For Hadoop-based applications, a common data pool (HDFS) provides many synergy benefits, making it very attractive to use these infrastructures for semantic data processing as well.
Indeed, existing SPARQL-on-Hadoop (MapReduce) approaches have already demonstrated very good scalability, however, query runtimes are rather slow due to the underlying batch processing framework. While this is acceptable for data-intensive queries, it is not satisfactory for the majority of SPARQL queries that are typically much more selective requiring only small subsets of the data.
In this paper, we present Sempala, a SPARQL-over-SQL-on-Hadoop approach designed with selective queries in mind. Our evaluation shows performance improvements by an order of magnitude compared to existing approaches, paving the way for interactive-time SPARQL query processing on Hadoop.
Pandas UDF and Python Type Hint in Apache Spark 3.0Databricks
In the past several years, the pandas UDFs are perhaps the most important changes to Apache Spark for Python data science. However, these functionalities have evolved organically, leading to some inconsistencies and confusions among users. In Apache Spark 3.0, the pandas UDFs were redesigned by leveraging type hints.
Debugging Apache Spark - Scala & Python super happy fun times 2017Holden Karau
Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. Holden Karau and Joey Echeverria explore how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, and some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose. Holden and Joey demonstrate how to effectively search logs from Apache Spark to spot common problems and discuss options for logging from within your program itself. Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but Holden and Joey look at how to effectively use Spark’s current accumulators for debugging before gazing into the future to see the data property type accumulators that may be coming to Spark in future versions. And in addition to reading logs and instrumenting your program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems. Holden and Joey cover how to quickly use the UI to figure out if certain types of issues are occurring in our job.
At the recent sold-out Spark & Machine Learning Meetup in Brussels, Holden Karau of the Spark Technology Center delivered a lightning talk called A very brief introduction to extending Spark ML for custom models: Talk + Demo.
Holden took a look at Apache SparkML™ pipelines. Inspired by sci-kit learn, they have the potential to make machine learning tasks much easier. This talk looked at how to extend Spark ML with custom model types when the built-in options don't meet your needs.
"SPARQL Cheat Sheet" is a short collection of slides intended to act as a guide to SPARQL developers. It includes the syntax and structure of SPARQL queries, common SPARQL prefixes and functions, and help with RDF datasets.
The "SPARQL Cheat Sheet" is intended to accompany the SPARQL By Example slides available at http://www.cambridgesemantics.com/2008/09/sparql-by-example/ .
Pandas is a fast and expressive library for data analysis that doesn’t naturally scale to more data than can fit in memory. PySpark is the Python API for Apache Spark that is designed to scale to huge amounts of data but lacks the natural expressiveness of Pandas. This talk introduces Sparkling Pandas, a library that brings together the best features of Pandas and PySpark; Expressiveness, speed, and scalability.
While both Spark 1.3 and Pandas have classes named ‘DataFrame’ the Pandas DataFrame API is broader and not fully covered by the ‘DataFrame’ class in Spark. This talk will explore some of the differences between Spark’s DataFrames and Panda’s DataFrames and then examine some of the work done to implement Panda’s like DataFrames on top of Spark. In some cases, providing Pandas like functionality is computationally expensive in a distributed environment, and we will explore some techniques to minimize this cost.
At the end of this talk you should have a better understanding of both Sparkling Pandas and Spark’s own DataFrames. Whether you end up using Sparkling Pandas or Spark directly, you will have a greater understanding of how to work with structured data in a distributed context using Apache Spark and familiar DataFrame APIs.
Keeping Spark on Track: Productionizing Spark for ETLDatabricks
ETL is the first phase when building a big data processing platform. Data is available from various sources and formats, and transforming the data into a compact binary format (Parquet, ORC, etc.) allows Apache Spark to process it in the most efficient manner. This talk will discuss common issues and best practices for speeding up your ETL workflows, handling dirty data, and debugging tips for identifying errors.
Speakers: Kyle Pistor & Miklos Christine
This talk was originally presented at Spark Summit East 2017.
Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. This talk will examine how to debug Apache Spark applications, the different options for logging in PySpark, as well as some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose, and this talk will examine how to effectively search logs from Apache Spark to spot common problems. In addition to the internal logging, this talk will look at options for logging from within our program itself.
Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but this talk will look at how to effectively use Spark’s current accumulators for debugging as well as a look to future for data property type accumulators which may be coming to Spark in future version.
In addition to reading logs, and instrumenting our program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems.
Debuggers are a wonderful tool, however when you have 100 computers the “wonder” can be a bit more like “pain”. This talk will look at how to connect remote debuggers, but also remind you that it’s probably not the easiest path forward.
The state of analytics has changed dramatically over the last few years. Hadoop is now commonplace, and the ecosystem has evolved to include new tools such as Spark, Shark, and Drill, that live alongside the old MapReduce-based standards. It can be difficult to keep up with the pace of change, and newcomers are left with a dizzying variety of seemingly similar choices. This is compounded by the number of possible deployment permutations, which can cause all but the most determined to simply stick with the tried and true. In this talk I will introduce you to a powerhouse combination of Cassandra and Spark, which provides a high-speed platform for both real-time and batch analysis.
Big Data Beyond the JVM - Strata San Jose 2018Holden Karau
Many of the recent big data systems, like Hadoop, Spark, and Kafka, are written primarily in JVM languages. At the same time, there is a wealth of tools for data science and data analytics that exist outside of the JVM. Holden Karau and Rachel Warren explore the state of the current big data ecosystem and explain how to best work with it in non-JVM languages. While much of the focus will be on Python + Spark, the talk will also include interesting anecdotes about how these lessons apply to other systems (including Kafka).
Holden and Rachel detail how to bridge the gap using PySpark and discuss other solutions like Kafka Streams as well. They also outline the challenges of pure Python solutions like dask. Holden and Rachel start with the current architecture of PySpark and its evolution. They then turn to the future, covering Arrow-accelerated interchange for Python functions, how to expose Python machine learning models into Spark, and how to use systems like Spark to accelerate training of traditional Python models. They also dive into what other similar systems are doing as well as what the options are for (almost) completely ignoring the JVM in the big data space.
Python users will learn how to more effectively use systems like Spark and understand how the design is changing. JVM developers will gain an understanding of how to Python code from data scientist and Python developers while avoiding the traditional trap of needing to rewrite everything.
Pandas UDF and Python Type Hint in Apache Spark 3.0Databricks
In the past several years, the pandas UDFs are perhaps the most important changes to Apache Spark for Python data science. However, these functionalities have evolved organically, leading to some inconsistencies and confusions among users. In Apache Spark 3.0, the pandas UDFs were redesigned by leveraging type hints.
Debugging Apache Spark - Scala & Python super happy fun times 2017Holden Karau
Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. Holden Karau and Joey Echeverria explore how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, and some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose. Holden and Joey demonstrate how to effectively search logs from Apache Spark to spot common problems and discuss options for logging from within your program itself. Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but Holden and Joey look at how to effectively use Spark’s current accumulators for debugging before gazing into the future to see the data property type accumulators that may be coming to Spark in future versions. And in addition to reading logs and instrumenting your program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems. Holden and Joey cover how to quickly use the UI to figure out if certain types of issues are occurring in our job.
At the recent sold-out Spark & Machine Learning Meetup in Brussels, Holden Karau of the Spark Technology Center delivered a lightning talk called A very brief introduction to extending Spark ML for custom models: Talk + Demo.
Holden took a look at Apache SparkML™ pipelines. Inspired by sci-kit learn, they have the potential to make machine learning tasks much easier. This talk looked at how to extend Spark ML with custom model types when the built-in options don't meet your needs.
"SPARQL Cheat Sheet" is a short collection of slides intended to act as a guide to SPARQL developers. It includes the syntax and structure of SPARQL queries, common SPARQL prefixes and functions, and help with RDF datasets.
The "SPARQL Cheat Sheet" is intended to accompany the SPARQL By Example slides available at http://www.cambridgesemantics.com/2008/09/sparql-by-example/ .
Pandas is a fast and expressive library for data analysis that doesn’t naturally scale to more data than can fit in memory. PySpark is the Python API for Apache Spark that is designed to scale to huge amounts of data but lacks the natural expressiveness of Pandas. This talk introduces Sparkling Pandas, a library that brings together the best features of Pandas and PySpark; Expressiveness, speed, and scalability.
While both Spark 1.3 and Pandas have classes named ‘DataFrame’ the Pandas DataFrame API is broader and not fully covered by the ‘DataFrame’ class in Spark. This talk will explore some of the differences between Spark’s DataFrames and Panda’s DataFrames and then examine some of the work done to implement Panda’s like DataFrames on top of Spark. In some cases, providing Pandas like functionality is computationally expensive in a distributed environment, and we will explore some techniques to minimize this cost.
At the end of this talk you should have a better understanding of both Sparkling Pandas and Spark’s own DataFrames. Whether you end up using Sparkling Pandas or Spark directly, you will have a greater understanding of how to work with structured data in a distributed context using Apache Spark and familiar DataFrame APIs.
Keeping Spark on Track: Productionizing Spark for ETLDatabricks
ETL is the first phase when building a big data processing platform. Data is available from various sources and formats, and transforming the data into a compact binary format (Parquet, ORC, etc.) allows Apache Spark to process it in the most efficient manner. This talk will discuss common issues and best practices for speeding up your ETL workflows, handling dirty data, and debugging tips for identifying errors.
Speakers: Kyle Pistor & Miklos Christine
This talk was originally presented at Spark Summit East 2017.
Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. This talk will examine how to debug Apache Spark applications, the different options for logging in PySpark, as well as some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose, and this talk will examine how to effectively search logs from Apache Spark to spot common problems. In addition to the internal logging, this talk will look at options for logging from within our program itself.
Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but this talk will look at how to effectively use Spark’s current accumulators for debugging as well as a look to future for data property type accumulators which may be coming to Spark in future version.
In addition to reading logs, and instrumenting our program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems.
Debuggers are a wonderful tool, however when you have 100 computers the “wonder” can be a bit more like “pain”. This talk will look at how to connect remote debuggers, but also remind you that it’s probably not the easiest path forward.
The state of analytics has changed dramatically over the last few years. Hadoop is now commonplace, and the ecosystem has evolved to include new tools such as Spark, Shark, and Drill, that live alongside the old MapReduce-based standards. It can be difficult to keep up with the pace of change, and newcomers are left with a dizzying variety of seemingly similar choices. This is compounded by the number of possible deployment permutations, which can cause all but the most determined to simply stick with the tried and true. In this talk I will introduce you to a powerhouse combination of Cassandra and Spark, which provides a high-speed platform for both real-time and batch analysis.
Big Data Beyond the JVM - Strata San Jose 2018Holden Karau
Many of the recent big data systems, like Hadoop, Spark, and Kafka, are written primarily in JVM languages. At the same time, there is a wealth of tools for data science and data analytics that exist outside of the JVM. Holden Karau and Rachel Warren explore the state of the current big data ecosystem and explain how to best work with it in non-JVM languages. While much of the focus will be on Python + Spark, the talk will also include interesting anecdotes about how these lessons apply to other systems (including Kafka).
Holden and Rachel detail how to bridge the gap using PySpark and discuss other solutions like Kafka Streams as well. They also outline the challenges of pure Python solutions like dask. Holden and Rachel start with the current architecture of PySpark and its evolution. They then turn to the future, covering Arrow-accelerated interchange for Python functions, how to expose Python machine learning models into Spark, and how to use systems like Spark to accelerate training of traditional Python models. They also dive into what other similar systems are doing as well as what the options are for (almost) completely ignoring the JVM in the big data space.
Python users will learn how to more effectively use systems like Spark and understand how the design is changing. JVM developers will gain an understanding of how to Python code from data scientist and Python developers while avoiding the traditional trap of needing to rewrite everything.
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Holden Karau
Slides from PyData London exploring how the big data ecosystem (currently) works together as well as how different parts of the ecosystem work with Python. Proof-of-concept examples are provided using nltk & spacy with Spark. Then we look to the future and how we can improve.
Many of the recent big data systems, like Hadoop, Spark, and Kafka, are written primarily in JVM languages. At the same time, there is a wealth of tools for data science and data analytics that exist outside of the JVM. Holden Karau and Rachel Warren explore the state of the current big data ecosystem and explain how to best work with it in non-JVM languages. While much of the focus will be on Python + Spark, the talk will also include interesting anecdotes about how these lessons apply to other systems (including Kafka).
Holden and Rachel detail how to bridge the gap using PySpark and discuss other solutions like Kafka Streams as well. They also outline the challenges of pure Python solutions like dask. Holden and Rachel start with the current architecture of PySpark and its evolution. They then turn to the future, covering Arrow-accelerated interchange for Python functions, how to expose Python machine learning models into Spark, and how to use systems like Spark to accelerate training of traditional Python models. They also dive into what other similar systems are doing as well as what the options are for (almost) completely ignoring the JVM in the big data space.
Python users will learn how to more effectively use systems like Spark and understand how the design is changing. JVM developers will gain an understanding of how to Python code from data scientist and Python developers while avoiding the traditional trap of needing to rewrite everything.
Challenges and patterns for semantics at scaleRob Vesse
Discusses some of the challenges around applying semantics at scale (tens of billions of triples and larger). Describes some of the patterns that can be used to meet those challenges.
Introduction to Spark Datasets - Functional and relational together at lastHolden Karau
Spark Datasets are an evolution of Spark DataFrames which allow us to work with both functional and relational transformations on big data with the speed of Spark.
The state of analytics has changed dramatically over the last few years. Hadoop is now commonplace, and the ecosystem has evolved to include new tools such as Spark, Shark, and Drill, that live alongside the old MapReduce-based standards. It can be difficult to keep up with the pace of change, and newcomers are left with a dizzying variety of seemingly similar choices. This is compounded by the number of possible deployment permutations, which can cause all but the most determined to simply stick with the tried and true.
Learn the serious advantages to the new tools, and get an analysis of the current state--including pros and cons as well as what's needed to bootstrap and operate the various options.
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...DataStax Academy
The state of analytics has changed dramatically over the last few years. Hadoop is now commonplace, and the ecosystem has evolved to include new tools such as Spark, Shark, and Drill, that live alongside the old MapReduce-based standards. It can be difficult to keep up with the pace of change, and newcomers are left with a dizzying variety of seemingly similar choices. This is compounded by the number of possible deployment permutations, which can cause all but the most determined to simply stick with the tried and true. But there are serious advantages to many of the new tools, and this presentation will give an analysis of the current state–including pros and cons as well as what’s needed to bootstrap and operate the various options.
About Robbie Strickland, Software Development Manager at The Weather Channel
Robbie works for The Weather Channel’s digital division as part of the team that builds backend services for weather.com and the TWC mobile apps. He has been involved in the Cassandra project since 2010 and has contributed in a variety of ways over the years; this includes work on drivers for Scala and C#, the Hadoop integration, heading up the Atlanta Cassandra Users Group, and answering lots of Stack Overflow questions.
Are general purpose big data systems eating the world?Holden Karau
Every-time there is a new piece of big data technology we often see many different specific implementations of the concepts, which often eventually consolidate down to a few viable options, and then frequently end up getting rolled into part of another larger project. This talk will examine this trend in big data ecosystem, look at the exceptions to the "rule", and look at how better interchange formats like Apache Arrow have the potential to change this going forward. In addition to general vague happy feelings (or sad depending on your ideas about how software should be made), this talk will look at some specific examples with deep learning, so if anyone is looking for a little bit of pixie dust to sprinkle on a failing business plan to take to silicon valley to raise a series A, you'll get something out this as well.
Video - https://www.youtube.com/watch?v=P_YKrLFZQJo
Sharing (or stealing) the jewels of python with big data & the jvm (1)Holden Karau
With the new Apache Arrow integration in PySpark 2.3, it is now starting become reasonable to look to the Python world and ask “what else do we want to steal besides tensorflow”, or as a Python developer look and say “how can I get my code into production without it being rewritten into a mess of Java?”
Regardless of your specific side(s) in the JVM/Python divide, collaboration is getting a lot faster, so lets learn how to share! In this brief talk we will examine sharing some of the wonders of Spacy with the Java world, which still has a somewhat lackluster set of options for NLP.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
1. C O M P U T E | S T O R E | A N A L Y Z E
Apache Jena Elephas
and Friends
RDF and the Hadoop Ecosystem
Rob Vesse
Twitter: @RobVesse
Email: rvesse@gmail.com
2. C O M P U T E | S T O R E | A N A L Y Z E
About Me
● Software Engineer at Cray Inc
● Working on:
● RDF and SPARQL
● Big Data Analytics
● Active open source contributor
● Apache Jena
● dotNetRDF
● Minor contributions to other Apache projects
● Assorted other bits and pieces on my GitHub and BitBucket
● Primarily interested in intersection of RDF/SPARQL world
with rest of Big Data world
3. C O M P U T E | S T O R E | A N A L Y Z E
Talk Overview
● What's missing in the Hadoop ecosystem?
● What's already available?
● Apache Jena Elephas
● Intel Graph Builder
● Other interesting projects
● Getting Involved
● Questions
4. C O M P U T E | S T O R E | A N A L Y Z E
What's missing in the Hadoop
ecosystem?
5. Apache, the projects and their logo shown here are registered trademarks or trademarks of The Apache Software Foundation in the U.S. and/or other countries
6. C O M P U T E | S T O R E | A N A L Y Z E
Where's RDF?
● No first class projects
● Some very limited support in other projects
● Giraph can support RDF by bridging through the Tinkerpop 2
stack
● Few existing projects
● Mostly academic proofs of concept (POC)
● Some open source efforts but often task specific
● e.g. Infovore targeted at creating curated Freebase and DBPedia
datasets
7. C O M P U T E | S T O R E | A N A L Y Z E
What's needed for RDF?
● Minimum Viable Product
● Standard Writable implementations for primitives
● Input and Output support
● Would be nice to have:
● Tools for translating data to and from RDF
● Integration with the common analytic frameworks
● e.g. Spark, Giraph, Hive, Pig
8. C O M P U T E | S T O R E | A N A L Y Z E
What's already available?
9. C O M P U T E | S T O R E | A N A L Y Z E
Apache Jena Elephas - Background
● Started as a POC at Cray
● Donated to the Apache Jena
project 1st April 2014
● JENA-666
● Originally known as Hadoop
RDF Tools
● Renamed to Elephas in
December 2014
● Name was suggested by
Claude Warren
10. C O M P U T E | S T O R E | A N A L Y Z E
Apache Jena Elephas - What is it?
● Set of modules part of the Apache Jena project
● Currently only developer SNAPSHOT builds available
● Will be included as part of upcoming Jena 2.13.0 release
● Aims to fulfill all the basic requirements for enabling RDF
on Hadoop
● Built against Hadoop 2.x APIs
11. C O M P U T E | S T O R E | A N A L Y Z E
Apache Jena Elephas - How do I use it?
● Read the documentation
● http://jena.apache.org/documentation/hadoop/
● Add appropriate Maven dependencies to your code
● http://jena.apache.org/documentation/hadoop/artifacts.html
● Will also need to declare relevant Hadoop dependencies as
"provided"
● Use the APIs as-is for basic tasks or use as starting point
for more complex applications
12. C O M P U T E | S T O R E | A N A L Y Z E
Apache Jena Elephas - Common API
● Provides Writable types for the RDF primitives
● NodeWritable
● TripleWritable
● QuadWritable
● NodeTupleWritable
● An arbitrarily sized tuples of RDF terms
● Backed by RDF Thrift
● A compact binary serialization for RDF using Apache Thrift
● See http://afs.github.io/rdf-thrift/
● Extremely efficient to serialize and de-serialize
● Allows for efficient WritableComparator implementations that
perform comparisons directly on the binary forms
13. C O M P U T E | S T O R E | A N A L Y Z E
Apache Jena Elephas - IO API
● Provides Hadoop InputFormat and OutputFormat
implementations for RDF
● Covers all RDF serializations Jena supports
● Easily extended with custom formats
● Splits and parallelizes processing of input where the RDF
serialization allows it
● Blank Nodes can be awkward
● Transparently handles compressed IO
14. C O M P U T E | S T O R E | A N A L Y Z E
Apache Jena Elephas - Blank Nodes
● Blank Nodes can be
problematic
● Need to consistently assign
IDs in parallel
● However you will typically
produce multiple
intermediate output files in
multi-job workflows
● Thus need to allow for
document versus globally
scoped IDs
● Configuration setting
controls this
● See documentation for
more information
15. C O M P U T E | S T O R E | A N A L Y Z E
Apache Jena Elephas - Map/Reduce API
● Various reusable basic Mapper and Reducer
implementations
● Covers common tasks:
● Counting
● Filtering
● Grouping
● Splitting
● Transformation
● Mostly intended for use as a starting point
● Some of these are bundled into a RDF stats demo
application
16. C O M P U T E | S T O R E | A N A L Y Z E
Apache Jena Elephas - Example Job
● Node Count (aka word count for RDF)
● All the classes referenced (bar Example.class) are provided by Elephas
Job job = Job.getInstance(config);
job.setJarByClass(Example.class);
job.setJobName("RDF Triples Node Usage Count");
// Map/Reduce classes
job.setMapperClass(TripleNodeCountMapper.class);
job.setMapOutputKeyClass(NodeWritable.class);
job.setMapOutputValueClass(LongWritable.class);
job.setReducerClass(NodeCountReducer.class);
// Input and Output
job.setInputFormatClass(NTriplesInputFormat.class);
job.setOutputFormatClass(NTriplesNodeOutputFormat.class);
FileInputFormat.setInputPath(job, new Path("/inputs/rdf"));
FileOutputFormat.setOutputPath(job, new Path("/outputs/rdf"));
17. C O M P U T E | S T O R E | A N A L Y Z E
Apache Jena Elephas - Node Count
Demo
See end of slide deck for steps to run the demo
and screenshots
18. C O M P U T E | S T O R E | A N A L Y Z E
Apache Jena Elephas - Performance Notes
● For NTriples inputs we compared performance of a Text
based node count versus RDF based node count
● Performance typically as good (within 10%) and
sometimes significantly better
● Heavily dataset dependent
● Varies considerably with cluster setup
● Also depends on how the input is processed
● Be aware YMMV!
19. C O M P U T E | S T O R E | A N A L Y Z E
Intel Graph Builder - What is it?
● Tools for transforming/creating large graphs
● Developed by Intel
● Cray has some proposed improvements that are awaiting
merging at time of writing
● Open source under Apache License
● https://github.com/01org/graphbuilder/tree/2.0.alpha
● 2.0.alpha is the preferred branch
● See https://github.com/cray/graphbuilder for the version
discussed here
● Allows graphs to be created/transformed from arbitrary
data sources using Apache Pig
20. C O M P U T E | S T O R E | A N A L Y Z E
Intel Graph Builder - How do I use it?
● REGISTER the Graph Builder JAR in your Pig script
● May optionally want to IMPORT the pig/graphbuilder.pig
script which aliases some of the provided UDFs
● LOAD your data
● Use the provided UDFs to generate a graph
● Can create both property graphs and RDF
● Currently data must be mapped to a property graph and then
into RDF
● STORE the resulting graph
21. C O M P U T E | S T O R E | A N A L Y Z E
Intel Graph Builder - How it works?
● Uses a declarative mapping based on Pig primitives
● Has to be explicitly joined to the data
● Limitation of Pig UDFs
● RDF mappings operate on property graphs
● Must map data to a property graph first
● Direct mapping to RDF is a possible future enhancement
22. C O M P U T E | S T O R E | A N A L Y Z E
Intel Graph Builder - Pig Script Example
https://github.com/Cray/graphbuilder/blob/2.0.alpha/examples/property_graphs_and_rdf_example.pig
-- Rest of script omitted for brevity
-- Declare our mappings
propertyGraphWithMappings = FOREACH propertyGraph GENERATE (*,
[ 'idBase' # 'http://example.org/instances/',
'base' # 'http://example.org/ontology/',
'namespaces' # [ 'foaf' # 'http://xmlns.com/foaf/0.1/' ],
'propertyMap' # [ 'type' # 'a',
'name' # 'foaf:name',
'age' # 'foaf:age' ],
'uriProperties' # ( 'type' ),
'idProperty' # 'id' ]);
-- Convert to NTriples
rdf_triples = FOREACH propertyGraphWithMappings GENERATE FLATTEN(RDF(*));
-- Write out NTriples
STORE rdf_triples INTO '/tmp/rdf_triples' USING PigStorage();
23. C O M P U T E | S T O R E | A N A L Y Z E
Intel Graph Builder - RDF
Generation Demo
See end of slide deck for steps to run the demo
and screenshots
24. C O M P U T E | S T O R E | A N A L Y Z E
Other Projects - Infovore
● Framework developed by Paul Houle
● Open source on GitHub
● https://github.com/paulhoule/infovore/wiki
● Apache License 2.0
● Produces a cleaned and curated Freebase dataset using
Hadoop for the processing
● Designed to be easily self-deployed on Amazon EC2
● Also some related projects for working with Wikipedia
● https://github.com/paulhoule/telepath
● Currently unclear what direction these projects will take
after the Freebase shutdown at end of March this year
25. C O M P U T E | S T O R E | A N A L Y Z E
Other Projects - CumulusRDF
● Academic project from Institute of Applied Informatics
and Formal Description Methods
● https://code.google.com/p/cumulusrdf/
● RDF store backed by Apache Cassandra
● Reasonable performance compared to native RDF stores
● See NoSQL Databases for RDF: An Empirical Evaluation
● Philippe Cudŕe-Mauroux et al
● http://exascale.info/sites/default/files/nosqlrdf.pdf
● Reasonably active development
26. C O M P U T E | S T O R E | A N A L Y Z E
Getting Involved
27. C O M P U T E | S T O R E | A N A L Y Z E
How to contribute
● Please download and try out these projects
● Interact with the communities and developers involved
● What works?
● What is broken?
● What is missing?
● How could the documentation be better?
● Contribute
● Open source ultimately lives or dies with community
participation
● If there's a missing feature then suggest it
● Or better still contribute it yourself!
28. C O M P U T E | S T O R E | A N A L Y Z E
Questions?
Personal Email: rvesse@gmail.com
Apache Jena User List: users@jena.apache.org
These slides will be posted to my SlideShare:
http://www.slideshare.net/RobVesse
29. C O M P U T E | S T O R E | A N A L Y Z E
Apache Jena Elephas - Node Count
Demo
30. C O M P U T E | S T O R E | A N A L Y Z E
Environment Pre-requisites
● Hadoop 2.x cluster
● Assumes hadoop command is on your PATH
● Download the latest JAR file
● Or build youself from source
● jena-hadoop-rdf-stats-VERSION-hadoop-job.jar
● Upload some RDF data to a HDFS folder
31. C O M P U T E | S T O R E | A N A L Y Z E
Run the Demo
● --node-count requests the Node Count statistics be calculated
● Assumes mixed quads and triples input if no --input-type specified
● Using this for triples only data can skew statistics
● e.g. can result in high node counts for default graph node
● Hence we explicitly specify input as triples
> hadoop jar jena-hadoop-rdf-stats-0.9.0-SNAPSHOT-hadoop-job.jar
org.apache.jena.hadoop.rdf.stats.RdfStats --node-count --output /user/output --
input-type triples /user/input
32.
33.
34.
35.
36.
37. C O M P U T E | S T O R E | A N A L Y Z E
Intel Graph Builder - RDF
Generation Demo
38. C O M P U T E | S T O R E | A N A L Y Z E
Environment Pre-requisites
● Pig 0.12
● Should work with higher but not tested
● Assumes pig command is on your PATH
● Clone the Cray version of the Graph Builder code
● https://github.com/cray/graphbuilder
39. C O M P U T E | S T O R E | A N A L Y Z E
Run the Demo
● Running Pig in local mode for simplicity
● Output goes to /tmp/rdf_triples/
> pig -x local examples/property_graphs_and_rdf.pig
> cat /tmp/rdf_triples/part-m-00000