This presentation delivers a tour of the graph analytics and open source projects from the graph team at PokitDok. This talk is from the inaugural GraphDay in Austin, TX on Jan 17th, 2016.
The PokitDok data science team uses many components in the TinkerPop stack, along with the Titan graph database. Let it be known, though, that we’re a serious Python shop. As a team, we wanted to do data analytics and not have to context switch between all the languages that are required to stand up this graph database. There was a desire to continue to use Python syntax when defining graph schema using the management system, performing graph traversals, building recommendation systems and so on, but the TinkerPop and Titan stacks run on the JVM.
Our solution: connect the development environments with Jython to build out our own Python library for graph traversals. We’ve open sourced the work we've been doing to help engineers and data scientists use Python to work within TinkerPop and Titan from a Python state of mind.
In this talk, PokitDok’s Engineer #1 teams up with a Data Scientist to discuss the intricacies of our development environment, introduce our open sourced Gremlin-Python library, and explore a graph based recommendation system. We will step through the underpinnings of Gremlin-Python to create a system that ranks and recommends healthcare professionals.
Git Fundamentals for beginner:
Learn important git commands
Learn Remote repo and Local Repo
GitLab
In this webinar, we will learn Git Fundamentals.
To watch the webinar visit: https://www.youtube.com/channel/UCU4mwvQ8ZAl1Uk7SkeVNuOg?view_as=subscriber
Git is a distributed version-control system for tracking changes in source code during software development.
GitFlow is a branching model for Git which is very well suited to collaboration and scaling the development team.
The PokitDok data science team uses many components in the TinkerPop stack, along with the Titan graph database. Let it be known, though, that we’re a serious Python shop. As a team, we wanted to do data analytics and not have to context switch between all the languages that are required to stand up this graph database. There was a desire to continue to use Python syntax when defining graph schema using the management system, performing graph traversals, building recommendation systems and so on, but the TinkerPop and Titan stacks run on the JVM.
Our solution: connect the development environments with Jython to build out our own Python library for graph traversals. We’ve open sourced the work we've been doing to help engineers and data scientists use Python to work within TinkerPop and Titan from a Python state of mind.
In this talk, PokitDok’s Engineer #1 teams up with a Data Scientist to discuss the intricacies of our development environment, introduce our open sourced Gremlin-Python library, and explore a graph based recommendation system. We will step through the underpinnings of Gremlin-Python to create a system that ranks and recommends healthcare professionals.
Git Fundamentals for beginner:
Learn important git commands
Learn Remote repo and Local Repo
GitLab
In this webinar, we will learn Git Fundamentals.
To watch the webinar visit: https://www.youtube.com/channel/UCU4mwvQ8ZAl1Uk7SkeVNuOg?view_as=subscriber
Git is a distributed version-control system for tracking changes in source code during software development.
GitFlow is a branching model for Git which is very well suited to collaboration and scaling the development team.
To introduce and motivate some best practice around version control and Git.
Resources:
https://en.wikipedia.org/wiki/Version_control
https://git-scm.com/
https://try.github.io
http://rogerdudler.github.io/git-guide/
http://ohshitgit.com/
https://www.atlassian.com/git/tutorials
https://www.datacamp.com/courses/introduction-to-git-for-data-science
Git is a free and open source distributed version control system designed to handle everything from small to very large projects with speed and efficiency.
This Tech Talk covers basic, intermediate and some advanced concepts of Git. Basic Git includes about types of version control system, three states of git, getting git repositories, recording changes, viewing staged and unstaged changes, committing changes, viewing commit history, working with remotes and Tagging.
Git's Killer Feature Branching has been discussed in detail; about branches in git, creating branches, switching branches, merging branches, rebasing, resolving merge conflicts and remote branches.
Other than that some useful features like Staging patches, stashing and cleaning, cherry-pick, git reset, git revert, interactive rebase and undoing merges have been discussed.
This informative tech talk was given at Atlogys by Tech Lead - Mr. Anoop Malav. It is also available on the Atlogys YouTube channel.
By far, the most widely used modern version control system in the world today is Git. Git is a mature, actively maintained open source project originally developed in 2005 by Linus Torvalds, the famous creator of the Linux operating system kernel. A staggering number of software projects rely on Git for version control, including commercial projects as well as open source. Developers who have worked with Git are well represented in the pool of available software development talent and it works well on a wide range of operating systems and IDEs (Integrated Development Environments).
https://www.atlassian.com/git/
Building a Distributed Build System at Google ScaleAysylu Greenberg
It’s hard to imagine a modern developer workflow without a sufficiently advanced build system: Make, Gradle, Maven, Rake, and many others. In this talk, we’ll discuss the evolution of build systems that leads to distributed build systems, like Google's BuildRabbit. Then, we’ll dive into how we can build a scalable system that is fast and resilient, with examples from Google. We’ll conclude with the discussion of general challenges of migrating systems from one architecture to another.
To introduce and motivate some best practice around version control and Git.
Resources:
https://en.wikipedia.org/wiki/Version_control
https://git-scm.com/
https://try.github.io
http://rogerdudler.github.io/git-guide/
http://ohshitgit.com/
https://www.atlassian.com/git/tutorials
https://www.datacamp.com/courses/introduction-to-git-for-data-science
Git is a free and open source distributed version control system designed to handle everything from small to very large projects with speed and efficiency.
This Tech Talk covers basic, intermediate and some advanced concepts of Git. Basic Git includes about types of version control system, three states of git, getting git repositories, recording changes, viewing staged and unstaged changes, committing changes, viewing commit history, working with remotes and Tagging.
Git's Killer Feature Branching has been discussed in detail; about branches in git, creating branches, switching branches, merging branches, rebasing, resolving merge conflicts and remote branches.
Other than that some useful features like Staging patches, stashing and cleaning, cherry-pick, git reset, git revert, interactive rebase and undoing merges have been discussed.
This informative tech talk was given at Atlogys by Tech Lead - Mr. Anoop Malav. It is also available on the Atlogys YouTube channel.
By far, the most widely used modern version control system in the world today is Git. Git is a mature, actively maintained open source project originally developed in 2005 by Linus Torvalds, the famous creator of the Linux operating system kernel. A staggering number of software projects rely on Git for version control, including commercial projects as well as open source. Developers who have worked with Git are well represented in the pool of available software development talent and it works well on a wide range of operating systems and IDEs (Integrated Development Environments).
https://www.atlassian.com/git/
Building a Distributed Build System at Google ScaleAysylu Greenberg
It’s hard to imagine a modern developer workflow without a sufficiently advanced build system: Make, Gradle, Maven, Rake, and many others. In this talk, we’ll discuss the evolution of build systems that leads to distributed build systems, like Google's BuildRabbit. Then, we’ll dive into how we can build a scalable system that is fast and resilient, with examples from Google. We’ll conclude with the discussion of general challenges of migrating systems from one architecture to another.
Efficient GitHub Crawling using the GraphQL APIMatthias Trapp
Presentation of the research paper "Efficient GitHub Crawling using the GraphQL API" at the 22nd International Conference on Computational Science and Its Applications in Malaga, Spain.
As part of the final BETTER Hackathon, project partners prepared 4 hackathon exercises. Fraunhofer IAIS organised this exercise in conjunction with external partner MKLab ITI-CERTH (EOPEN project). This step-by-step exercise featured the setup of local Docker images on Linux OS featuring Dcoker Compose and (pre-installed) Python, SANSA, Hadoop, Apache Spark and Apache Zeppelin. It featured semantic transformation and and the use of SANSA (Scalable Semantic Analytics Stack - http://sansa-stack.net/) libraries on a sample of tweets ahead of geo-clustering.
Project website (Hackathon information): https://www.ec-better.eu/pages/2nd-hackathon
Github repository: https://github.com/ec-better/hackathon-2020-semanticgeoclustering
How to verify your Kotlin project in a Kotlin way? What linter, code coverage tool and static code analysis plugin to use?! We might know our Java counterparts for this, but what to do when you write your talks in Kotlin?
Kotlin is designed to fully interoperate with Java, mainly provoked by the fact that the JVM version of Kotlin's standard library depends on the Java Class Library. Nevertheless, Kotlin’s standard library has some new tricks which are not supported by Java. Therefore, the Java verification tools might not interpret all the cool new stuff that we’re writing in Kotlin in the right way.
In this talk we’ll go over some plugins (Kover, Ktlint and Detekt) that are specifically designed for the Kotlin language to fully support your Kotlin project!
From Docker To Kubernetes: A Developer's Guide To Containers - Mandy White - ...Codemotion
Codemotion Rome 2015 - Everyone is talking about Containers, but mostly in the context of how they work and not why and when they are useful or how to apply them to your own often complex and unique Use Cases. We'll start by looking at how Docker works by manually creating a simple guestbook application using Docker Containers running Redis and PHP. We'll then use the same application to show how you can use Kubernetes and Google Container Engine to create a cluster of nodes, declare to that cluster what you expect it to do, and then have the cluster assign resources as needed, run your work, recover from failures.
A series of tweets I posted about my 11hr struggle to make a cup of tea with my WiFi kettle ended-up going viral, got picked-up by the national and then international press, and led to thousands of retweets, comments and references in the media. In this session we’ll take the data I recorded on this Twitter activity over the period and use Oracle Big Data Graph and Spatial to understand what caused the breakout and the tweet going viral, who were the key influencers and connectors, and how the tweet spread over time and over geography from my original series of posts in Hove, England.
4Developers: Grzegorz Piwowarek- Java Wars VIII: The Function AwakensPROIDEA
Prezentacja ma na celu wprowadzenie programistów Javy w świat elementów paradygmatu programowania funkcyjnego, które pojawiły się w Javie 8. Java 8 jest już na rynku trochę czasu, ale ciągle dużo osób nie wykorzystuje w pełni możliwości jakie daje im obecność monad oraz wyrażeń lambda. Podczas prezentacji przekażę jak za ich pomocą znacznie ułatwić sobie codzienną pracę w Javie 8. Opowiem również jak rykoszetem wraz z wyrażeniami lambda otrzymaliśmy zmienne leniwie inicjowane.
Presentation of the paper "Primers or Reminders? The Effects of Existing Review Comments on Code Review" published at ICSE 2020.
Authors:
Davide Spadini, Gül Calikli, Alberto Bacchelli
Link to the paper: https://research.tudelft.nl/en/publications/primers-or-reminders-the-effects-of-existing-review-comments-on-c
Gimel at Teradata Analytics Universe 2018Romit Mehta
This is our presentation of Gimel at Teradata's annual conference, Teradata Analytics Universe.
Gimel is the open source unified data API which enables connectivity to any data store with a single API. Along with the API which works with Scala and Python, we are also surfacing a SQL interface to access any data store with just SQL.
Now data scientists and analysts can directly consume data from big data platforms like Kafka for real-time streaming data access or Elastic for search-related data all with SQL just like they can access Oracle or Teradata.
On the other hand, data engineers can relax now with this abstracted API since it isolates the ever-changing world of big data infrastructure from their code. No longer do they need to worry about API versions, connector versions, data store-specific semantics, or compute engine and version.
Gimel is also tightly integrated with Jupyter notebooks so all of the power is now available to anyone with a browser.
gimel.io
unifieddatacatalog.io
ppextensions.io
Ping me on LinkedIn for more info!
The Agenda for the Webinar:
1. Introduction to Python.
2. Python and Big Data.
3. Python and Data Science.
4. Key features of Python and their usage in Business Analytics.
5. Business Analytics with Python – Real world Use Cases.
Python has been one of the premier, flexible, and powerful open-source language that is easy to learn, easy to use, and has powerful libraries for data manipulation and analysis. It’s easy to learn simple syntax is very accessible to new programmers and is similar to Matlab, C/C++, Java, or Visual Basic. Python is general purpose and comparatively easy to learn with an increased adoption for analytical and quantitative computing. For over a decade, Python has been used in scientific computing and highly quantitative domains such as finance, oil and gas, physics, and signal processing.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Graph Day Texas: Open Source Graph Projects from PokitDok
1. A tour of the PokitDok Health Graph and
some open source graph projects
Graph Day Texas, Jan 2016
Denise Gosnell, PhD
Twitter and Github:
@pokitdok
@denisekgosnell
2. Confidential 2
PokitDok APIs:
The business of health,
for developers.
https://platform.pokitdok.com/
Twitter and Github:
@pokitdok
@denisekgosnell
6. 6
What we built.
The HealthGraph
What we’ve open sourced.
A Gremlin-Python Library
Custom Titan Build
Dynamic JSON Graph [WIP]
HealthGraph DSL [WIP]
Talk Outline:
Twitter and Github:
@pokitdok
@denisekgosnell
10. Confidential 10
Health Graph: Transaction as Trees
• We treat transactions as
first-class objects in the
graph
• Buried in the depth of an
X12 transactions are the
entities of interest
Twitter and Github:
@pokitdok
Interactive graph available at:
https://fullmetalhealth.com/dsl/
14. Confidential 14
HealthGraph: Predictive Models
• What is the probability claim X will be denied?
• A new customer just searched for “family practice”;
recommend the best provider within 10 miles.
• Given a CPT code, what is the expected
reimbursement rate from insurance company A in zip
code 37601?
Twitter and Github:
@pokitdok
@denisekgosnell
17. Confidential 17
Our HealthGraph
Production Stack
• Titan 0.5.3
• TinkerPop’s
Blueprints 2.50
• Cassandra
and Elastic Search
Gremlin-Python
Twitter and Github:
@pokitdok
@denisekgosnell
18. Confidential 18
• Lighter Context Switching between
development tools and environments
• Incompatible syntax issues between
Gremlin and Python
• Using Python.
Gremlin-Python Motivation
Twitter and Github:
@corbinbs
@denisekgosnell
19. Confidential 19
Option 1: Grab our docker container
1. Install Docker
https://www.docker.com/docker-toolbox
2. Jump in the “Docker Quickstart Terminal”
3. Fire up our example container:
docker run -i -t pokitdok/gremlin-python-test-drive
Option 2: Shell script install
1. Clone our repo:
https://github.com/pokitdok/gremlin-python
2. Run the set-up scripts:
$./test_drive/setup.sh &&./test_drive/run.sh
Gremlin-Python Test Drive
Twitter and Github:
@corbinbs
@denisekgosnell
27. Confidential 27
Motivation for Release of Custom Build:
Graph Production Stack:
Titan 0.5.x ships with Hadoop 2.2
API Production Stack:
contains Cloudera’s CDH5 containers and Hadoop 2.6.0
You guessed it:
infrastructure dependency errors upon integration
the Hadoop 2.6.0 API is not fully backwards compatible
with Hadoop 2.2
Twitter and Github:
@pokitdok
28. Confidential 28
Released:
A modification of the Titan 0.5.3 build
to upgrade to Hadoop 2.6.0 and
resolve numerous conflicts among
transitive dependencies.
… someone had to do it.
Grab it here:
https://github.com/pokitdok/titan/tree/
0.5.3-hadoop2.6.0
Tested for Cassandra but not
Hbase.
Twitter and Github:
@pokitdok
31. Confidential 31
1. Extract PokitDok HealthGraph specific features
2. Move to Titan 1.0 and TP3 compatibility
3. Release on PokitDok GitHub
Dyanmic JSONLoader Future Work
Twitter and Github:
@pokitdok
36. Confidential 36
1. Move to Titan 1.0 and TP3 compatibility
2. Release on PokitDok GitHub
3. Current Open Question:
We are looking for(ward to) more documentation on
implementing custom gremlin steps(DSLs) in TP3
DSL Future Work
Twitter and Github:
@pokitdok
39. A tour of the PokitDok Health Graph and
some open source graph projects
Graph Day Texas, Jan 2016
Denise Gosnell, PhD
Twitter and Github:
@pokitdok
@denisekgosnell
Editor's Notes
Personal story of how I got into graph analytics; graph lineage
we made all of our stuff available via API.
For something the crowd can go see ---
Relevant Timing: Xerox is powered by Pokitdok
we are tackling two while fields.
navigating the wild and quickly change space of graph technology while also trying to modernize healthcare
transitional purposes only
what kind of data do we have
We are using graph paths to calculate a high density of providers with a co-occurance across payors – we can also find this by plan.
GOAL: infer provider networks across plans – or whichever slice of the data we prefer
we can also answer all sorts of questions
Current healthcare infrastructure is fractured and antiquated… they can’t answer these questions.
4.3 million providers
This is a slide about why
data management:
data engineering: loading of data into a database
data science: probabilistic inferences
updates to transitive dependencies aren’t sexy, but aren’t you glad you don’t have to do this now? Someone had to do it.
There were people on the titan users group who suggested they had built titan 0.5 for hadoop 2.6 themselves, but we could not find any publically. That is why we released this.
slightly more interesting than dependency whack a mole --
Bulk load of JSON from squenced HDFS files
Bulk load of JSON from squenced HDFS files
We have created a groovy-gremlin based graph DSL for entity retrieval. The DSL is accessible from client scripts in python or groovy, or via TinkerPop’s gremlin console.
Personal story of how I got into graph analytics; graph lineage