Video of the presentation is available here: https://www.youtube.com/watch?v=bv4fbnvEV94
GraphQL is the Better REST
GraphQL is a data query language developed by Facebook and released under MIT license. It offers a number of advantages for the development of Web services compared to the traditional REST approach. The talk will describe what REST pain points GraphQL is designed to address and will illustrate GraphQL basic concepts and the ways to develop GraphQL-based Web services.
Michael Smolyak
As a software engineer, Michael has been fortunate to find a career he truly enjoys. During his 20+ career he has been a full-stack developer, a technical lead, a software architect, a college instructor, a book editor and a mentor. Regardless of his position and a role at work Michael never ceases to be a student with interests ranging from software engineering methodologies, to functional reactive programming, to machine learning, to deploying to the cloud. For the last 7 years of his career he found a good home at Next Century Corporation. In his free time, Michael loves to read, listen to audiobooks and podcasts. He enjoys traveling with his family, running and portrait photography.
Data Journalism at The Baltimore BannerData Works MD
Data Works MD February 2023 - https://www.meetup.com/dataworks/events/290813196/
Video -
-------------------------------------------------
Data Journalism at The Baltimore Banner
In this presentation, data journalism Nick Thieme will be presenting on what data journalism looks like at The Baltimore Banner. Nick will be discussing how data journalism dovetails with local news and will highlight several of the projects they have been working on. Several of The Baltimore Banner's recent articles featuring data journalism can be found here.
-------------------------------------------------
Nick Thieme creates rigorous data journalism with the goal of exposing and undoing systemic inequities by using the tools of statistics to discover reliable information about Baltimore. He grew up in the D.C. area, moving to Baltimore in the 2010s. After a time creating data journalism for Atlanta at the Atlanta Journal-Constitution, he's excited to return home to use his work to make the city a more equitable place. Nick can be reached on Twitter.
Jolt’s Picks - Machine Learning and Major League Baseball Hit StreaksData Works MD
Data Works MD September 2022 - https://www.meetup.com/dataworks/events/288251332/
Video - https://www.youtube.com/watch?v=zzLJrMyCLik
-------------------------------------------------
Jolt's Picks: Using Machine Learning to Predict Major League Baseball Hit Streaks
Can you beat baseball legend Joe DiMaggio’s 56 consecutive game hit streak? The short answer is no, you cannot. But, can you play Bet MGM’s “Beat The Streak” and win $5.6 million? Yes, you can! In this presentation FanGraphs writer and Data Scientist Lucas Kelly will present how he used machine learning in an attempt to predict a hitter most likely to get a hit in a major league baseball game each day, leveraging analytics to make more intelligent decisions. You will see how machine learning pipelines, API calls, and out-of-box thinking may help win an impossible game.
-------------------------------------------------
Lucas Kelly is a writer for the baseball analytics website FanGraphs and a Data Scientist at the World Wildlife Fund. He is a baseball analytics hobbyist and loves to play and write about fantasy baseball. He uses his data science background and python coding skills to gain an edge in his fantasy leagues. He has not yet "Beat The Streak". If he did, he would be living on his own personal island somewhere.
Data Works MD July 2021 - https://www.meetup.com/DataWorks/events/278394107/
Video - https://youtu.be/WXA1yX8O3Lc
-------------------------------------------------
Introducing Datawave: Scalable Data Ingest and Query on Apache Accumulo
Out of the box, Accumulo's strengths are difficult to appreciate without first building an application that showcases its capabilities to handle massive amounts of data. Unfortunately, building such an application is non-trivial for many would-be users, which affects Accumulo's adoption.
In this talk, we introduce Datawave, a complete ingest, query, and analytic framework for Accumulo. Datawave, recently open-sourced by the National Security Agency, capitalizes on Accumulo's capabilities, provides an API for working with structured and unstructured data, and boasts a robust, flexible, and scalable backend.
We'll do a deep dive into Datawave's project layout, table structures, and APIs in addition to demonstrating the Datawave quickstart—a tool that makes it incredibly easy to hit the ground running with Accumulo and Datawave without having to develop a complete application.
Datawave - https://code.nsa.gov/datawave/
-------------------------------------------------
Hannah Pellón received her B.S. in Mathematics from the University of Maryland while working as a software engineering intern at Northrop Grumman conducting RF signal analysis and spectrometry. She spent 11 years at Northrop Grumman thereafter contributing to IR&D efforts and programs centered around Accumulo and Hadoop. She is currently a software developer and lead at Tiber Technologies focusing on Datawave and distributed computing technologies
Malware Detection, Enabled by Machine LearningData Works MD
Data Works MD January 2021 - https://www.meetup.com/DataWorks/events/274890810/
Video - https://www.youtube.com/watch?v=Bwy9aPOdDPE
----------------------------------------
Malware Detection, Enabled by Machine Learning
With the scale of new malware being created each year growing, as well as the expanding market opportunities for malware reuse, protecting systems can’t rely solely on downloading a vendor’s updated virus signature files. Our customers need ways to detect and cordon likely threats, by using data retrieved from a combination of static and behavioral characteristics, and comparing it to other classes of “good” versus “bad” files. Optimally, the solution cordons risky files, force ranks them according to their likelihood of causing harm, correlates some metadata to help with further learning and to provide context to analysts, and lets an analyst “release” a file after further analysis and a request from a user. Oh, with that feedback relayed back into the model to support further tuning.
This talk will delve into IRAD efforts ClearEdge is doing on building and integrating malware detectors using machine learning algorithms.
----------------------------------------
Tina Coleman is a Technical Director for ClearEdge. In that role, she’s accountable for furthering the company’s depth in cybersecurity, particularly in aspects that allow ClearEdge to build solutions that scale for customer needs using its strengths in software engineering, dev ops, and data science. In addition to her work on contract and as a Technical Director, Ms. Coleman leads the Women In Technology program for ClearEdge, which seeks to encourage the participation and retention of women in technology. Ms. Coleman graduated from UMBC with undergraduate degrees in Computer Science and Economics and is currently pursuing her Masters in Cybersecurity Technology from University of Maryland, Global Campus. Tina can be found on LinkedIn at https://www.linkedin.com/in/tinadcoleman/
Using AWS, Terraform, and Ansible to Automate Splunk at ScaleData Works MD
The DreamPort Splunk Project; How We Use AWS, Terraform, and Ansible to Automate Everything About a Splunk Cluster
At DreamPort, we use cloud platforms, infrastructure-as-code tooling, configuration tools, automation software, and container technologies to very quickly design, develop, and prototype projects. This particular talk focuses on the tools used to deploy and configure a Splunk cluster for a particular project we recently ran. We will cover the deployment, configuration, and orchestration of a large 16 node Splunk cluster using tools that are a core set to DreamPort's cloud infrastructure toolbox; AWS, Terraform, Ansible, and Docker.
It is recommended that attendees have a general understanding of AWS, Linux, Splunk, and Docker, and know about automation tools such as Terraform and Ansible.
Attendees will learn how to use AWS, Terraform, Ansible, and Docker to deploy a large Splunk cluster, how to use Ansible to orchestrate and manage the Splunk cluster, and how to use Ansible to orchestrate and manage the Splunk cluster.
-------------------------------------------------
Bill Cawthra is a Principal Cloud Infrastructure Architect for CyberPoint, managing project-related cloud systems and platforms. He works primarily on the AWS platform, using various automation tools to rapidly deploy and manage infrastructure. Bill has over 18 years of experience in computers and technology, working in a range of fields, including construction, DoD, health care, and social media.
Data Works MD April 2020 - https://www.meetup.com/DataWorks/events/269772382/
----------------------------------------
Video available at https://www.youtube.com/watch?v=RTy176hpr8Q
----------------------------------------
A Day in the Life of a Data Journalist
Despite gaining prominence in recent years, “data journalism” is still a confusing term for many people. What does it mean to crunch numbers for the news? How does data journalism differ from data science and statistics, and where are the intersections?
Come hear all about the world of news nerdery from your friendly neighborhood data journalist.
----------------------------------------
Christine Zhang just joined the Financial Times as a data journalist on the US elections team for 2020. Previously, she was a data journalist at The Baltimore Sun, where she used numbers, statistics and graphics to tell local news stories on a variety of topics, including police overtime, homicide patterns, population demographics, local and statewide politics — and even made a series of plots visualizing the impressive performance of Ravens quarterback Lamar Jackson. Prior to joining The Sun in 2018, she worked at Two Sigma in New York City, the Los Angeles Times in Los Angeles and the Brookings Institution in Washington, D.C. She has a B.A. from Smith College and an M.A. from Columbia University.
Christine's bylines: https://underthecurve.github.io/bylines/
Christine's LinkedIn: https://www.linkedin.com/in/christineyzhang/
Christine's author page: https://www.baltimoresun.com/bal-christine-zhang-20180802-staff.html
Robotics and Machine Learning: Working with NVIDIA Jetson KitsData Works MD
Data Works MD December 2019 - https://www.meetup.com/DataWorks/events/265823739/
Video is available at https://www.youtube.com/watch?v=EFHUdKTDRZM
Robotics and Machine Learning: Working with NVIDIA Jetson Kits
Interested in machine learning and AI? Do you want to learn more about high performance GPU programming and how it applies to Deep Learning? Patty Delafuente will introduce you to the Nvidia Jetson developer kits, discuss their applications, how to get started, and provide a live demonstration of NVIDIA® Jetson Nano™, an easy to use deep learning and robotics platform.
Patty Delafuente
Patty Delafuente, is a lead Data Scientist in the Advanced Data Analytics Lab at the Social Security Administration. She teaches evening classes for the University of Maryland Baltimore County’s graduate level Data Science Program. Patty is a member of the ‘MS in Analytics Advisory Board’ for Texas A&M University.
Patty holds a Master of Science in Analytics from Texas A&M University along with a Bachelor and Masters in Information Systems and holds numerous certifications. In 2017, she was awarded the Texas A&M Margaret Sheather Memorial Award in Analytics for her Capstone Project, “Using Decision Trees to Analyze Patterns in Disability Fraud.”
She has over twenty years of database engineering, business intelligence, and analytics experience.
Her interests include machine learning, text mining, and using GPUs to improve the performance of analyzing and processing big data. She is a certified Nvidia Instructor in the ‘Fundamentals of Deep Learning for Computer Vision’ and ‘Accelerated Computing with Python’. Patty can be reached LinkedIn at https://www.linkedin.com/in/pattydelafuente319/
Connect Data and Devices with Apache NiFiData Works MD
Data Works MD November 2019 - https://www.meetup.com/DataWorks/events/265433970/
Video is available at https://youtu.be/JklA7FNUVhY
Connect Data and Devices with Apache NiFi
Apache NiFi is an easy to use, powerful, and reliable system to process and distribute data. It comes with a wonderful management UI, a large marketplace of standard Processors, and a great Open Source Community behind it. This session will show you how to move data across servers & networks. It will show you how to manipulate data, enrich data, and stream data through custom enrichment processors.
The talk is designed to walk you through the NiFi basics, while showing practical examples you can follow-along with. The examples will include showing how to perform data manipulation using a custom java processor, the ExecuteScript processor, with JavaScript and Python, and the JoltTransformData processor. Open-source tools, such as Jolt, jQ, and JsonPath will be demonstrated. Finally, it will show how you could prototype a REST service with Standard Processors! There will even be a light-bulb flashing from things happening in NiFi.
Ryan Hendrickson is a Senior Software Engineer and Director of Innovation who joined Clarity Business Solutions in 2015. He is a Software Project Co-Lead, participates in the Maryland Data Works Meetup, attended OSCON 2018, presented at CodeMash 2019, and enjoys working on his 1966 MGB.
Bill Farmer is VP of Engineering and senior software engineer at Clarity Business Solutions where he is responsible for bringing innovative opportunities to solve some of the customer’s hardest data problems. Bill has over twenty years experience building data processing and visualization systems across a variety of domains including finance, transportation, and government.
Elli Schwarz is a Senior Software Engineer at Clarity Business Solutions. He has 15 years experience developing Java applications, creating custom data processing solutions, and applying specialized data models and ontologies to facilitate data exchange. An Apache Nifi enthusiast, he enjoys using Nifi to performing complex ETL tasks for his clients.
Data Journalism at The Baltimore BannerData Works MD
Data Works MD February 2023 - https://www.meetup.com/dataworks/events/290813196/
Video -
-------------------------------------------------
Data Journalism at The Baltimore Banner
In this presentation, data journalism Nick Thieme will be presenting on what data journalism looks like at The Baltimore Banner. Nick will be discussing how data journalism dovetails with local news and will highlight several of the projects they have been working on. Several of The Baltimore Banner's recent articles featuring data journalism can be found here.
-------------------------------------------------
Nick Thieme creates rigorous data journalism with the goal of exposing and undoing systemic inequities by using the tools of statistics to discover reliable information about Baltimore. He grew up in the D.C. area, moving to Baltimore in the 2010s. After a time creating data journalism for Atlanta at the Atlanta Journal-Constitution, he's excited to return home to use his work to make the city a more equitable place. Nick can be reached on Twitter.
Jolt’s Picks - Machine Learning and Major League Baseball Hit StreaksData Works MD
Data Works MD September 2022 - https://www.meetup.com/dataworks/events/288251332/
Video - https://www.youtube.com/watch?v=zzLJrMyCLik
-------------------------------------------------
Jolt's Picks: Using Machine Learning to Predict Major League Baseball Hit Streaks
Can you beat baseball legend Joe DiMaggio’s 56 consecutive game hit streak? The short answer is no, you cannot. But, can you play Bet MGM’s “Beat The Streak” and win $5.6 million? Yes, you can! In this presentation FanGraphs writer and Data Scientist Lucas Kelly will present how he used machine learning in an attempt to predict a hitter most likely to get a hit in a major league baseball game each day, leveraging analytics to make more intelligent decisions. You will see how machine learning pipelines, API calls, and out-of-box thinking may help win an impossible game.
-------------------------------------------------
Lucas Kelly is a writer for the baseball analytics website FanGraphs and a Data Scientist at the World Wildlife Fund. He is a baseball analytics hobbyist and loves to play and write about fantasy baseball. He uses his data science background and python coding skills to gain an edge in his fantasy leagues. He has not yet "Beat The Streak". If he did, he would be living on his own personal island somewhere.
Data Works MD July 2021 - https://www.meetup.com/DataWorks/events/278394107/
Video - https://youtu.be/WXA1yX8O3Lc
-------------------------------------------------
Introducing Datawave: Scalable Data Ingest and Query on Apache Accumulo
Out of the box, Accumulo's strengths are difficult to appreciate without first building an application that showcases its capabilities to handle massive amounts of data. Unfortunately, building such an application is non-trivial for many would-be users, which affects Accumulo's adoption.
In this talk, we introduce Datawave, a complete ingest, query, and analytic framework for Accumulo. Datawave, recently open-sourced by the National Security Agency, capitalizes on Accumulo's capabilities, provides an API for working with structured and unstructured data, and boasts a robust, flexible, and scalable backend.
We'll do a deep dive into Datawave's project layout, table structures, and APIs in addition to demonstrating the Datawave quickstart—a tool that makes it incredibly easy to hit the ground running with Accumulo and Datawave without having to develop a complete application.
Datawave - https://code.nsa.gov/datawave/
-------------------------------------------------
Hannah Pellón received her B.S. in Mathematics from the University of Maryland while working as a software engineering intern at Northrop Grumman conducting RF signal analysis and spectrometry. She spent 11 years at Northrop Grumman thereafter contributing to IR&D efforts and programs centered around Accumulo and Hadoop. She is currently a software developer and lead at Tiber Technologies focusing on Datawave and distributed computing technologies
Malware Detection, Enabled by Machine LearningData Works MD
Data Works MD January 2021 - https://www.meetup.com/DataWorks/events/274890810/
Video - https://www.youtube.com/watch?v=Bwy9aPOdDPE
----------------------------------------
Malware Detection, Enabled by Machine Learning
With the scale of new malware being created each year growing, as well as the expanding market opportunities for malware reuse, protecting systems can’t rely solely on downloading a vendor’s updated virus signature files. Our customers need ways to detect and cordon likely threats, by using data retrieved from a combination of static and behavioral characteristics, and comparing it to other classes of “good” versus “bad” files. Optimally, the solution cordons risky files, force ranks them according to their likelihood of causing harm, correlates some metadata to help with further learning and to provide context to analysts, and lets an analyst “release” a file after further analysis and a request from a user. Oh, with that feedback relayed back into the model to support further tuning.
This talk will delve into IRAD efforts ClearEdge is doing on building and integrating malware detectors using machine learning algorithms.
----------------------------------------
Tina Coleman is a Technical Director for ClearEdge. In that role, she’s accountable for furthering the company’s depth in cybersecurity, particularly in aspects that allow ClearEdge to build solutions that scale for customer needs using its strengths in software engineering, dev ops, and data science. In addition to her work on contract and as a Technical Director, Ms. Coleman leads the Women In Technology program for ClearEdge, which seeks to encourage the participation and retention of women in technology. Ms. Coleman graduated from UMBC with undergraduate degrees in Computer Science and Economics and is currently pursuing her Masters in Cybersecurity Technology from University of Maryland, Global Campus. Tina can be found on LinkedIn at https://www.linkedin.com/in/tinadcoleman/
Using AWS, Terraform, and Ansible to Automate Splunk at ScaleData Works MD
The DreamPort Splunk Project; How We Use AWS, Terraform, and Ansible to Automate Everything About a Splunk Cluster
At DreamPort, we use cloud platforms, infrastructure-as-code tooling, configuration tools, automation software, and container technologies to very quickly design, develop, and prototype projects. This particular talk focuses on the tools used to deploy and configure a Splunk cluster for a particular project we recently ran. We will cover the deployment, configuration, and orchestration of a large 16 node Splunk cluster using tools that are a core set to DreamPort's cloud infrastructure toolbox; AWS, Terraform, Ansible, and Docker.
It is recommended that attendees have a general understanding of AWS, Linux, Splunk, and Docker, and know about automation tools such as Terraform and Ansible.
Attendees will learn how to use AWS, Terraform, Ansible, and Docker to deploy a large Splunk cluster, how to use Ansible to orchestrate and manage the Splunk cluster, and how to use Ansible to orchestrate and manage the Splunk cluster.
-------------------------------------------------
Bill Cawthra is a Principal Cloud Infrastructure Architect for CyberPoint, managing project-related cloud systems and platforms. He works primarily on the AWS platform, using various automation tools to rapidly deploy and manage infrastructure. Bill has over 18 years of experience in computers and technology, working in a range of fields, including construction, DoD, health care, and social media.
Data Works MD April 2020 - https://www.meetup.com/DataWorks/events/269772382/
----------------------------------------
Video available at https://www.youtube.com/watch?v=RTy176hpr8Q
----------------------------------------
A Day in the Life of a Data Journalist
Despite gaining prominence in recent years, “data journalism” is still a confusing term for many people. What does it mean to crunch numbers for the news? How does data journalism differ from data science and statistics, and where are the intersections?
Come hear all about the world of news nerdery from your friendly neighborhood data journalist.
----------------------------------------
Christine Zhang just joined the Financial Times as a data journalist on the US elections team for 2020. Previously, she was a data journalist at The Baltimore Sun, where she used numbers, statistics and graphics to tell local news stories on a variety of topics, including police overtime, homicide patterns, population demographics, local and statewide politics — and even made a series of plots visualizing the impressive performance of Ravens quarterback Lamar Jackson. Prior to joining The Sun in 2018, she worked at Two Sigma in New York City, the Los Angeles Times in Los Angeles and the Brookings Institution in Washington, D.C. She has a B.A. from Smith College and an M.A. from Columbia University.
Christine's bylines: https://underthecurve.github.io/bylines/
Christine's LinkedIn: https://www.linkedin.com/in/christineyzhang/
Christine's author page: https://www.baltimoresun.com/bal-christine-zhang-20180802-staff.html
Robotics and Machine Learning: Working with NVIDIA Jetson KitsData Works MD
Data Works MD December 2019 - https://www.meetup.com/DataWorks/events/265823739/
Video is available at https://www.youtube.com/watch?v=EFHUdKTDRZM
Robotics and Machine Learning: Working with NVIDIA Jetson Kits
Interested in machine learning and AI? Do you want to learn more about high performance GPU programming and how it applies to Deep Learning? Patty Delafuente will introduce you to the Nvidia Jetson developer kits, discuss their applications, how to get started, and provide a live demonstration of NVIDIA® Jetson Nano™, an easy to use deep learning and robotics platform.
Patty Delafuente
Patty Delafuente, is a lead Data Scientist in the Advanced Data Analytics Lab at the Social Security Administration. She teaches evening classes for the University of Maryland Baltimore County’s graduate level Data Science Program. Patty is a member of the ‘MS in Analytics Advisory Board’ for Texas A&M University.
Patty holds a Master of Science in Analytics from Texas A&M University along with a Bachelor and Masters in Information Systems and holds numerous certifications. In 2017, she was awarded the Texas A&M Margaret Sheather Memorial Award in Analytics for her Capstone Project, “Using Decision Trees to Analyze Patterns in Disability Fraud.”
She has over twenty years of database engineering, business intelligence, and analytics experience.
Her interests include machine learning, text mining, and using GPUs to improve the performance of analyzing and processing big data. She is a certified Nvidia Instructor in the ‘Fundamentals of Deep Learning for Computer Vision’ and ‘Accelerated Computing with Python’. Patty can be reached LinkedIn at https://www.linkedin.com/in/pattydelafuente319/
Connect Data and Devices with Apache NiFiData Works MD
Data Works MD November 2019 - https://www.meetup.com/DataWorks/events/265433970/
Video is available at https://youtu.be/JklA7FNUVhY
Connect Data and Devices with Apache NiFi
Apache NiFi is an easy to use, powerful, and reliable system to process and distribute data. It comes with a wonderful management UI, a large marketplace of standard Processors, and a great Open Source Community behind it. This session will show you how to move data across servers & networks. It will show you how to manipulate data, enrich data, and stream data through custom enrichment processors.
The talk is designed to walk you through the NiFi basics, while showing practical examples you can follow-along with. The examples will include showing how to perform data manipulation using a custom java processor, the ExecuteScript processor, with JavaScript and Python, and the JoltTransformData processor. Open-source tools, such as Jolt, jQ, and JsonPath will be demonstrated. Finally, it will show how you could prototype a REST service with Standard Processors! There will even be a light-bulb flashing from things happening in NiFi.
Ryan Hendrickson is a Senior Software Engineer and Director of Innovation who joined Clarity Business Solutions in 2015. He is a Software Project Co-Lead, participates in the Maryland Data Works Meetup, attended OSCON 2018, presented at CodeMash 2019, and enjoys working on his 1966 MGB.
Bill Farmer is VP of Engineering and senior software engineer at Clarity Business Solutions where he is responsible for bringing innovative opportunities to solve some of the customer’s hardest data problems. Bill has over twenty years experience building data processing and visualization systems across a variety of domains including finance, transportation, and government.
Elli Schwarz is a Senior Software Engineer at Clarity Business Solutions. He has 15 years experience developing Java applications, creating custom data processing solutions, and applying specialized data models and ontologies to facilitate data exchange. An Apache Nifi enthusiast, he enjoys using Nifi to performing complex ETL tasks for his clients.
Data Works MD September 2019 - https://www.meetup.com/DataWorks/events/264711404/
Video is available at https://www.youtube.com/watch?v=Y3b4Cnnilfw
Introduction to Machine Learning
Machine Learning continues its’ rise in the common day vernacular and is used anywhere from automating mundane tasks to offering intelligent insights across many industries. You may already be using a device that utilizes it. For example, a wearable fitness tracker like Fitbit, or an intelligent home assistant like Google Home. But there are much more examples of ML in use.
• Predictive Analysis
• Image recognition
• Speech Recognition
• Medical diagnoses
• Cyber Security
This session will cover an introduction to Machine Learning to include data modeling, supervised/ unsupervised learning and visualizations.
Stephen Scarbrough, CISSP, C|EH
Joined the US Navy in 1990 and retired after 20 years as a CTNC(SW/AW/NAC). Early career was as Tactical Communications operator onboard surface ships and aircraft. Begin Network Administration and Network Security in late1998. In 2005,
Joined the NSA/CSS Red/Blue Team for several years. In 2010 I retired and joined IntelliGenesis LLC in which I am a Senior SIGINT Development Analyst and currently the lead contractor for the National Cryptologic Schools DATA Curriculum, which include Data Science and Advanced Analytics Tradecraft mentoring.
Data in the City: Analytics and Civic Data in BaltimoreData Works MD
Data Works MD August 2019 - https://www.meetup.com/DataWorks/events/263516699/
Data in the City: Analytics and Civic Data in Baltimore
Does Baltimore City government even know how to use data? Why is [insert service here] still paper-based? Where are the hotspots for illegal dumping? Is President Trump right about the rats?
Smart cities, civic data use, and urban data analytics are all hot topics, but what are the current capabilities and applications in our own city government? Justin Elszasz and Babila Lima from the City of Baltimore will showcase a few examples of how data is being used to improve city services for the residents of Baltimore, from simple performance management to predictive analytics.
Justin Elszasz
Justin served as the data scientist for the Bloomberg Philanthropies-funded Innovation Team in the Mayor’s Office before taking on his current role, where he manages the CitiStat program and leads analysis across CitiStat and the Innovation Team. With his previous organization, Navigant Consulting, Justin supported the U.S. Department of Energy’s appliance standards program – the unsung hero of the Obama administration’s climate change efforts – through analysis and developing Federal regulations. He also used data science to improve utility energy efficiency programs.
Justin holds an M.S. in mechanical engineering from Columbia University, where he researched applied data science in the energy sector and was a National Science Foundation fellow in the “Integrative Graduate Education and Research Traineeship”, with a program focus of “Solving Urbanization Challenges by Design”. He has prior experience as a design engineer in the aerospace and medical device industries. Justin can be reached at https://www.linkedin.com/in/justinelszasz/
Exploring Correlation Between Sentiment of Environmental Tweets and the Stock...Data Works MD
Video of the presentation is available here: https://youtu.be/L6EMnvALYtU
Talk: Fortune 500 Company Performance Analysis Using Social Networks
Speaker: Yi-Shan Shir
This presentation focus on studying the correlation between financial performance and social media relationship and behavior of Fortune 500 companies. The findings from this research can assist in the prediction of Fortune 500 stock performance based on a number of social network analysis metrics.
Automated Software Requirements LabelingData Works MD
Video of the presentation is available here: https://youtu.be/L6EMnvALYtU
Talk: Machine Learning for Requirements Engineering
Speaker: Jon Patton
This project applies a number of machine learning, deep learning, and NLP techniques to solve challenging problems in requirements engineering.
Introduction to Elasticsearch for Business Intelligence and Application InsightsData Works MD
Video of the presentation is available here: https://youtu.be/L6EMnvALYtU
Talk: Elasticsearch for Business Intelligence and Application Insights
Speaker: Sean Donnelly
Elasticsearch is a distributed, RESTful search and analytics engine capable of solving a growing number of use cases. In this talk, I’ll discuss the fundamentals of storage and retrieval in Elasticsearch, why we decided to use it for search in our applications, and how you can also leverage it for both business intelligence and application insights.
An Asynchronous Distributed Deep Learning Based Intrusion Detection System fo...Data Works MD
Video of the presentation is available here: https://youtu.be/L6EMnvALYtU
Talk: An Asynchronous Distributed Deep Learning Based Intrusion Detection System for IoT Devices
Speaker: Pu Tian
Intrusion Detection Systems (IDS) in IoT devices are crucial for cybersecurity. Existing models may fail due to increased traffic pattern complexity and data complexity. To address these challenges, we propose an asynchronous distributed deep learning based IDS in which only training weights are shared and devices of heterogeneous computing power can train asynchronously. Empirical results on a large network intrusion dataset show that the system achieves high detection accuracy.
RAPIDS – Open GPU-accelerated Data ScienceData Works MD
RAPIDS – Open GPU-accelerated Data Science
RAPIDS is an initiative driven by NVIDIA to accelerate the complete end-to-end data science ecosystem with GPUs. It consists of several open source projects that expose familiar interfaces making it easy to accelerate the entire data science pipeline- from the ETL and data wrangling to feature engineering, statistical modeling, machine learning, and graph analysis.
Corey J. Nolet
Corey has a passion for understanding the world through the analysis of data. He is a developer on the RAPIDS open source project focused on accelerating machine learning algorithms with GPUs.
Adam Thompson
Adam Thompson is a Senior Solutions Architect at NVIDIA. With a background in signal processing, he has spent his career participating in and leading programs focused on deep learning for RF classification, data compression, high-performance computing, and managing and designing applications targeting large collection frameworks. His research interests include deep learning, high-performance computing, systems engineering, cloud architecture/integration, and statistical signal processing. He holds a Masters degree in Electrical & Computer Engineering from Georgia Tech and a Bachelors from Clemson University.
Two Algorithms for Weakly Supervised Denoising of EEG DataData Works MD
The document describes two algorithms for weakly supervised denoising of EEG data:
1. An ICA and multi-instance learning solution that uses ICA to decompose EEG signals into components, extracts SAX features from the components, and uses multi-instance learning to classify components as artifacts or not.
2. An asymmetric generative adversarial network solution that is proposed to improve the model by making it online, fully automated, and end-to-end.
The talk discusses challenges in using EEG data like noise and the need for artifact removal algorithms, and provides an overview of related work on artifact removal including ICA-based approaches.
Detecting Lateral Movement with a Compute-Intense Graph KernelData Works MD
Cybersecurity Analytics on a D-Wave Quantum Computer
Effective cybersecurity analysis requires frequent exploration of graphs of many types and sizes, the computational cost of which can be overwhelming if not carefully chosen. After briefly introducing the D-Wave quantum computing system, we describe an analytic for finding “lateral movement” in an enterprise network, i.e., an intruder or insider threat hopping from system to system to gain access to more information. This analytic depends on maximum independent set, an NP-hard graph kernel whose computational cost grows exponentially with the size of the graph and so has not been widely used in cyber analysis. The growing strength of D-Wave’s quantum computers on such NP-hard problems will enable new analytics. We discuss practicalities of the current implementation and implications of this approach.
Steve Reinhardt has built hardware/software systems that deliver new levels of performance usable via conceptually simple interfaces, including Cray Research’s T3E distributed-memory systems, ISC’s Star-P parallel-MATLAB software, and YarcData/Cray’s Urika graph-analytic systems. He now leads D-Wave’s efforts working with customers to map early applications to D-Wave systems.
Predictive Analytics and Neighborhood HealthData Works MD
After the 2008 recession, Kansas City, MO, experienced waves of unemployment and foreclosures that led many properties to fall into disrepair. Faced with this growing issue during a period of decreased funding, the city’s code enforcement officials were unable to keep up with the workload, creating an enormous backlog and doubling the workload for each inspector. Together with the JHU Center for Government Excellence (GovEx), the city developed an algorithm to predict how long a given violation will take to resolve based on internal and public data that will help inspectors proactively schedule follow-up inspections and connect more serious cases to community programs earlier.
Matt is the Chief Data Scientist at the Johns Hopkins University Center for Government Excellence, where he and his team help governments apply data to performance challenges and improve the quality of life of their constituents. Prior to joining GovEx, Matt led the data, GIS, and targeting programs for national and state political campaigns, labor unions, and non-profits as they sought to register, persuade, and motivate voters. He was also the lead GIS analyst for Delaware’s State House of Representatives redistricting project in 2010.
Social Network Analysis Workshop
This talk will be a workshop featuring an overview of basic theory and methods for social network analysis and an introduction to igraph. The first half of the talk will be a discussion of the concepts and the second half will feature code examples and demonstrations.
Igraph is a package in R, Python, and C++ that supports social network analysis and network data visualization.
Ian McCulloh holds joint appointments as a Parson’s Fellow in the Bloomberg School of Public health, a Senior Lecturer in the Whiting School of Engineering and a senior scientist at the Applied Physics Lab, at Johns Hopkins University. His current research is focused on strategic influence in online networks. His most recent papers have been focused on the neuroscience of persuasion and measuring influence in online social media firestorms. He is the author of “Social Network Analysis with Applications” (Wiley: 2013), “Networks Over Time” (Oxford: forthcoming) and has published 48 peer-reviewed papers, primarily in the area of social network analysis. His current applied work is focused on educating soldiers and marines in advanced methods for open source research and data science leadership.
More information about Dr. Ian McCulloh's work can be found at https://ep.jhu.edu/about-us/faculty-directory/1511-ian-mcculloh
06-20-2024-AI Camp Meetup-Unstructured Data and Vector DatabasesTimothy Spann
Tech Talk: Unstructured Data and Vector Databases
Speaker: Tim Spann (Zilliz)
Abstract: In this session, I will discuss the unstructured data and the world of vector databases, we will see how they different from traditional databases. In which cases you need one and in which you probably don’t. I will also go over Similarity Search, where do you get vectors from and an example of a Vector Database Architecture. Wrapping up with an overview of Milvus.
Introduction
Unstructured data, vector databases, traditional databases, similarity search
Vectors
Where, What, How, Why Vectors? We’ll cover a Vector Database Architecture
Introducing Milvus
What drives Milvus' Emergence as the most widely adopted vector database
Hi Unstructured Data Friends!
I hope this video had all the unstructured data processing, AI and Vector Database demo you needed for now. If not, there’s a ton more linked below.
My source code is available here
https://github.com/tspannhw/
Let me know in the comments if you liked what you saw, how I can improve and what should I show next? Thanks, hope to see you soon at a Meetup in Princeton, Philadelphia, New York City or here in the Youtube Matrix.
Get Milvused!
https://milvus.io/
Read my Newsletter every week!
https://github.com/tspannhw/FLiPStackWeekly/blob/main/141-10June2024.md
For more cool Unstructured Data, AI and Vector Database videos check out the Milvus vector database videos here
https://www.youtube.com/@MilvusVectorDatabase/videos
Unstructured Data Meetups -
https://www.meetup.com/unstructured-data-meetup-new-york/
https://lu.ma/calendar/manage/cal-VNT79trvj0jS8S7
https://www.meetup.com/pro/unstructureddata/
https://zilliz.com/community/unstructured-data-meetup
https://zilliz.com/event
Twitter/X: https://x.com/milvusio https://x.com/paasdev
LinkedIn: https://www.linkedin.com/company/zilliz/ https://www.linkedin.com/in/timothyspann/
GitHub: https://github.com/milvus-io/milvus https://github.com/tspannhw
Invitation to join Discord: https://discord.com/invite/FjCMmaJng6
Blogs: https://milvusio.medium.com/ https://www.opensourcevectordb.cloud/ https://medium.com/@tspann
https://www.meetup.com/unstructured-data-meetup-new-york/events/301383476/?slug=unstructured-data-meetup-new-york&eventId=301383476
https://www.aicamp.ai/event/eventdetails/W2024062014
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...Marlon Dumas
This webinar discusses the limitations of traditional approaches for business process simulation based on had-crafted model with restrictive assumptions. It shows how process mining techniques can be assembled together to discover high-fidelity digital twins of end-to-end processes from event data.
We are pleased to share with you the latest VCOSA statistical report on the cotton and yarn industry for the month of May 2024.
Starting from January 2024, the full weekly and monthly reports will only be available for free to VCOSA members. To access the complete weekly report with figures, charts, and detailed analysis of the cotton fiber market in the past week, interested parties are kindly requested to contact VCOSA to subscribe to the newsletter.
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)Rebecca Bilbro
To honor ten years of PyData London, join Dr. Rebecca Bilbro as she takes us back in time to reflect on a little over ten years working as a data scientist. One of the many renegade PhDs who joined the fledgling field of data science of the 2010's, Rebecca will share lessons learned the hard way, often from watching data science projects go sideways and learning to fix broken things. Through the lens of these canon events, she'll identify some of the anti-patterns and red flags she's learned to steer around.
Data Works MD September 2019 - https://www.meetup.com/DataWorks/events/264711404/
Video is available at https://www.youtube.com/watch?v=Y3b4Cnnilfw
Introduction to Machine Learning
Machine Learning continues its’ rise in the common day vernacular and is used anywhere from automating mundane tasks to offering intelligent insights across many industries. You may already be using a device that utilizes it. For example, a wearable fitness tracker like Fitbit, or an intelligent home assistant like Google Home. But there are much more examples of ML in use.
• Predictive Analysis
• Image recognition
• Speech Recognition
• Medical diagnoses
• Cyber Security
This session will cover an introduction to Machine Learning to include data modeling, supervised/ unsupervised learning and visualizations.
Stephen Scarbrough, CISSP, C|EH
Joined the US Navy in 1990 and retired after 20 years as a CTNC(SW/AW/NAC). Early career was as Tactical Communications operator onboard surface ships and aircraft. Begin Network Administration and Network Security in late1998. In 2005,
Joined the NSA/CSS Red/Blue Team for several years. In 2010 I retired and joined IntelliGenesis LLC in which I am a Senior SIGINT Development Analyst and currently the lead contractor for the National Cryptologic Schools DATA Curriculum, which include Data Science and Advanced Analytics Tradecraft mentoring.
Data in the City: Analytics and Civic Data in BaltimoreData Works MD
Data Works MD August 2019 - https://www.meetup.com/DataWorks/events/263516699/
Data in the City: Analytics and Civic Data in Baltimore
Does Baltimore City government even know how to use data? Why is [insert service here] still paper-based? Where are the hotspots for illegal dumping? Is President Trump right about the rats?
Smart cities, civic data use, and urban data analytics are all hot topics, but what are the current capabilities and applications in our own city government? Justin Elszasz and Babila Lima from the City of Baltimore will showcase a few examples of how data is being used to improve city services for the residents of Baltimore, from simple performance management to predictive analytics.
Justin Elszasz
Justin served as the data scientist for the Bloomberg Philanthropies-funded Innovation Team in the Mayor’s Office before taking on his current role, where he manages the CitiStat program and leads analysis across CitiStat and the Innovation Team. With his previous organization, Navigant Consulting, Justin supported the U.S. Department of Energy’s appliance standards program – the unsung hero of the Obama administration’s climate change efforts – through analysis and developing Federal regulations. He also used data science to improve utility energy efficiency programs.
Justin holds an M.S. in mechanical engineering from Columbia University, where he researched applied data science in the energy sector and was a National Science Foundation fellow in the “Integrative Graduate Education and Research Traineeship”, with a program focus of “Solving Urbanization Challenges by Design”. He has prior experience as a design engineer in the aerospace and medical device industries. Justin can be reached at https://www.linkedin.com/in/justinelszasz/
Exploring Correlation Between Sentiment of Environmental Tweets and the Stock...Data Works MD
Video of the presentation is available here: https://youtu.be/L6EMnvALYtU
Talk: Fortune 500 Company Performance Analysis Using Social Networks
Speaker: Yi-Shan Shir
This presentation focus on studying the correlation between financial performance and social media relationship and behavior of Fortune 500 companies. The findings from this research can assist in the prediction of Fortune 500 stock performance based on a number of social network analysis metrics.
Automated Software Requirements LabelingData Works MD
Video of the presentation is available here: https://youtu.be/L6EMnvALYtU
Talk: Machine Learning for Requirements Engineering
Speaker: Jon Patton
This project applies a number of machine learning, deep learning, and NLP techniques to solve challenging problems in requirements engineering.
Introduction to Elasticsearch for Business Intelligence and Application InsightsData Works MD
Video of the presentation is available here: https://youtu.be/L6EMnvALYtU
Talk: Elasticsearch for Business Intelligence and Application Insights
Speaker: Sean Donnelly
Elasticsearch is a distributed, RESTful search and analytics engine capable of solving a growing number of use cases. In this talk, I’ll discuss the fundamentals of storage and retrieval in Elasticsearch, why we decided to use it for search in our applications, and how you can also leverage it for both business intelligence and application insights.
An Asynchronous Distributed Deep Learning Based Intrusion Detection System fo...Data Works MD
Video of the presentation is available here: https://youtu.be/L6EMnvALYtU
Talk: An Asynchronous Distributed Deep Learning Based Intrusion Detection System for IoT Devices
Speaker: Pu Tian
Intrusion Detection Systems (IDS) in IoT devices are crucial for cybersecurity. Existing models may fail due to increased traffic pattern complexity and data complexity. To address these challenges, we propose an asynchronous distributed deep learning based IDS in which only training weights are shared and devices of heterogeneous computing power can train asynchronously. Empirical results on a large network intrusion dataset show that the system achieves high detection accuracy.
RAPIDS – Open GPU-accelerated Data ScienceData Works MD
RAPIDS – Open GPU-accelerated Data Science
RAPIDS is an initiative driven by NVIDIA to accelerate the complete end-to-end data science ecosystem with GPUs. It consists of several open source projects that expose familiar interfaces making it easy to accelerate the entire data science pipeline- from the ETL and data wrangling to feature engineering, statistical modeling, machine learning, and graph analysis.
Corey J. Nolet
Corey has a passion for understanding the world through the analysis of data. He is a developer on the RAPIDS open source project focused on accelerating machine learning algorithms with GPUs.
Adam Thompson
Adam Thompson is a Senior Solutions Architect at NVIDIA. With a background in signal processing, he has spent his career participating in and leading programs focused on deep learning for RF classification, data compression, high-performance computing, and managing and designing applications targeting large collection frameworks. His research interests include deep learning, high-performance computing, systems engineering, cloud architecture/integration, and statistical signal processing. He holds a Masters degree in Electrical & Computer Engineering from Georgia Tech and a Bachelors from Clemson University.
Two Algorithms for Weakly Supervised Denoising of EEG DataData Works MD
The document describes two algorithms for weakly supervised denoising of EEG data:
1. An ICA and multi-instance learning solution that uses ICA to decompose EEG signals into components, extracts SAX features from the components, and uses multi-instance learning to classify components as artifacts or not.
2. An asymmetric generative adversarial network solution that is proposed to improve the model by making it online, fully automated, and end-to-end.
The talk discusses challenges in using EEG data like noise and the need for artifact removal algorithms, and provides an overview of related work on artifact removal including ICA-based approaches.
Detecting Lateral Movement with a Compute-Intense Graph KernelData Works MD
Cybersecurity Analytics on a D-Wave Quantum Computer
Effective cybersecurity analysis requires frequent exploration of graphs of many types and sizes, the computational cost of which can be overwhelming if not carefully chosen. After briefly introducing the D-Wave quantum computing system, we describe an analytic for finding “lateral movement” in an enterprise network, i.e., an intruder or insider threat hopping from system to system to gain access to more information. This analytic depends on maximum independent set, an NP-hard graph kernel whose computational cost grows exponentially with the size of the graph and so has not been widely used in cyber analysis. The growing strength of D-Wave’s quantum computers on such NP-hard problems will enable new analytics. We discuss practicalities of the current implementation and implications of this approach.
Steve Reinhardt has built hardware/software systems that deliver new levels of performance usable via conceptually simple interfaces, including Cray Research’s T3E distributed-memory systems, ISC’s Star-P parallel-MATLAB software, and YarcData/Cray’s Urika graph-analytic systems. He now leads D-Wave’s efforts working with customers to map early applications to D-Wave systems.
Predictive Analytics and Neighborhood HealthData Works MD
After the 2008 recession, Kansas City, MO, experienced waves of unemployment and foreclosures that led many properties to fall into disrepair. Faced with this growing issue during a period of decreased funding, the city’s code enforcement officials were unable to keep up with the workload, creating an enormous backlog and doubling the workload for each inspector. Together with the JHU Center for Government Excellence (GovEx), the city developed an algorithm to predict how long a given violation will take to resolve based on internal and public data that will help inspectors proactively schedule follow-up inspections and connect more serious cases to community programs earlier.
Matt is the Chief Data Scientist at the Johns Hopkins University Center for Government Excellence, where he and his team help governments apply data to performance challenges and improve the quality of life of their constituents. Prior to joining GovEx, Matt led the data, GIS, and targeting programs for national and state political campaigns, labor unions, and non-profits as they sought to register, persuade, and motivate voters. He was also the lead GIS analyst for Delaware’s State House of Representatives redistricting project in 2010.
Social Network Analysis Workshop
This talk will be a workshop featuring an overview of basic theory and methods for social network analysis and an introduction to igraph. The first half of the talk will be a discussion of the concepts and the second half will feature code examples and demonstrations.
Igraph is a package in R, Python, and C++ that supports social network analysis and network data visualization.
Ian McCulloh holds joint appointments as a Parson’s Fellow in the Bloomberg School of Public health, a Senior Lecturer in the Whiting School of Engineering and a senior scientist at the Applied Physics Lab, at Johns Hopkins University. His current research is focused on strategic influence in online networks. His most recent papers have been focused on the neuroscience of persuasion and measuring influence in online social media firestorms. He is the author of “Social Network Analysis with Applications” (Wiley: 2013), “Networks Over Time” (Oxford: forthcoming) and has published 48 peer-reviewed papers, primarily in the area of social network analysis. His current applied work is focused on educating soldiers and marines in advanced methods for open source research and data science leadership.
More information about Dr. Ian McCulloh's work can be found at https://ep.jhu.edu/about-us/faculty-directory/1511-ian-mcculloh
06-20-2024-AI Camp Meetup-Unstructured Data and Vector DatabasesTimothy Spann
Tech Talk: Unstructured Data and Vector Databases
Speaker: Tim Spann (Zilliz)
Abstract: In this session, I will discuss the unstructured data and the world of vector databases, we will see how they different from traditional databases. In which cases you need one and in which you probably don’t. I will also go over Similarity Search, where do you get vectors from and an example of a Vector Database Architecture. Wrapping up with an overview of Milvus.
Introduction
Unstructured data, vector databases, traditional databases, similarity search
Vectors
Where, What, How, Why Vectors? We’ll cover a Vector Database Architecture
Introducing Milvus
What drives Milvus' Emergence as the most widely adopted vector database
Hi Unstructured Data Friends!
I hope this video had all the unstructured data processing, AI and Vector Database demo you needed for now. If not, there’s a ton more linked below.
My source code is available here
https://github.com/tspannhw/
Let me know in the comments if you liked what you saw, how I can improve and what should I show next? Thanks, hope to see you soon at a Meetup in Princeton, Philadelphia, New York City or here in the Youtube Matrix.
Get Milvused!
https://milvus.io/
Read my Newsletter every week!
https://github.com/tspannhw/FLiPStackWeekly/blob/main/141-10June2024.md
For more cool Unstructured Data, AI and Vector Database videos check out the Milvus vector database videos here
https://www.youtube.com/@MilvusVectorDatabase/videos
Unstructured Data Meetups -
https://www.meetup.com/unstructured-data-meetup-new-york/
https://lu.ma/calendar/manage/cal-VNT79trvj0jS8S7
https://www.meetup.com/pro/unstructureddata/
https://zilliz.com/community/unstructured-data-meetup
https://zilliz.com/event
Twitter/X: https://x.com/milvusio https://x.com/paasdev
LinkedIn: https://www.linkedin.com/company/zilliz/ https://www.linkedin.com/in/timothyspann/
GitHub: https://github.com/milvus-io/milvus https://github.com/tspannhw
Invitation to join Discord: https://discord.com/invite/FjCMmaJng6
Blogs: https://milvusio.medium.com/ https://www.opensourcevectordb.cloud/ https://medium.com/@tspann
https://www.meetup.com/unstructured-data-meetup-new-york/events/301383476/?slug=unstructured-data-meetup-new-york&eventId=301383476
https://www.aicamp.ai/event/eventdetails/W2024062014
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...Marlon Dumas
This webinar discusses the limitations of traditional approaches for business process simulation based on had-crafted model with restrictive assumptions. It shows how process mining techniques can be assembled together to discover high-fidelity digital twins of end-to-end processes from event data.
We are pleased to share with you the latest VCOSA statistical report on the cotton and yarn industry for the month of May 2024.
Starting from January 2024, the full weekly and monthly reports will only be available for free to VCOSA members. To access the complete weekly report with figures, charts, and detailed analysis of the cotton fiber market in the past week, interested parties are kindly requested to contact VCOSA to subscribe to the newsletter.
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)Rebecca Bilbro
To honor ten years of PyData London, join Dr. Rebecca Bilbro as she takes us back in time to reflect on a little over ten years working as a data scientist. One of the many renegade PhDs who joined the fledgling field of data science of the 2010's, Rebecca will share lessons learned the hard way, often from watching data science projects go sideways and learning to fix broken things. Through the lens of these canon events, she'll identify some of the anti-patterns and red flags she's learned to steer around.
06-18-2024-Princeton Meetup-Introduction to MilvusTimothy Spann
06-18-2024-Princeton Meetup-Introduction to Milvus
tim.spann@zilliz.com
https://www.linkedin.com/in/timothyspann/
https://x.com/paasdev
https://github.com/tspannhw
https://github.com/milvus-io/milvus
Get Milvused!
https://milvus.io/
Read my Newsletter every week!
https://github.com/tspannhw/FLiPStackWeekly/blob/main/142-17June2024.md
For more cool Unstructured Data, AI and Vector Database videos check out the Milvus vector database videos here
https://www.youtube.com/@MilvusVectorDatabase/videos
Unstructured Data Meetups -
https://www.meetup.com/unstructured-data-meetup-new-york/
https://lu.ma/calendar/manage/cal-VNT79trvj0jS8S7
https://www.meetup.com/pro/unstructureddata/
https://zilliz.com/community/unstructured-data-meetup
https://zilliz.com/event
Twitter/X: https://x.com/milvusio https://x.com/paasdev
LinkedIn: https://www.linkedin.com/company/zilliz/ https://www.linkedin.com/in/timothyspann/
GitHub: https://github.com/milvus-io/milvus https://github.com/tspannhw
Invitation to join Discord: https://discord.com/invite/FjCMmaJng6
Blogs: https://milvusio.medium.com/ https://www.opensourcevectordb.cloud/ https://medium.com/@tspann
Expand LLMs' knowledge by incorporating external data sources into LLMs and your AI applications.
We are pleased to share with you the latest VCOSA statistical report on the cotton and yarn industry for the month of March 2024.
Starting from January 2024, the full weekly and monthly reports will only be available for free to VCOSA members. To access the complete weekly report with figures, charts, and detailed analysis of the cotton fiber market in the past week, interested parties are kindly requested to contact VCOSA to subscribe to the newsletter.
Discover the cutting-edge telemetry solution implemented for Alan Wake 2 by Remedy Entertainment in collaboration with AWS. This comprehensive presentation dives into our objectives, detailing how we utilized advanced analytics to drive gameplay improvements and player engagement.
Key highlights include:
Primary Goals: Implementing gameplay and technical telemetry to capture detailed player behavior and game performance data, fostering data-driven decision-making.
Tech Stack: Leveraging AWS services such as EKS for hosting, WAF for security, Karpenter for instance optimization, S3 for data storage, and OpenTelemetry Collector for data collection. EventBridge and Lambda were used for data compression, while Glue ETL and Athena facilitated data transformation and preparation.
Data Utilization: Transforming raw data into actionable insights with technologies like Glue ETL (PySpark scripts), Glue Crawler, and Athena, culminating in detailed visualizations with Tableau.
Achievements: Successfully managing 700 million to 1 billion events per month at a cost-effective rate, with significant savings compared to commercial solutions. This approach has enabled simplified scaling and substantial improvements in game design, reducing player churn through targeted adjustments.
Community Engagement: Enhanced ability to engage with player communities by leveraging precise data insights, despite having a small community management team.
This presentation is an invaluable resource for professionals in game development, data analytics, and cloud computing, offering insights into how telemetry and analytics can revolutionize player experience and game performance optimization.
3. What is REST?
● REST is the de-facto standard for designing Web APIs
● REST (REpresentational State Transfer) is an architectural
style for developing web services.
● REST is popular due to its simplicity and the fact that it
builds upon existing systems and features of the internet's
HTTP.
4.
5. REST Design Principles
● Stateless server
● Uniform interface
● Everything is a resource
● Use HTTP verbs for CRUD actions
● Explore associations through URL structure
● Attribute filters, sorting and pagination through query
parameters
● Implement global and local searches
● Error handling using HTTP error codes
● Use JSON as data exchange format
6. REST APIs have shown to
be too inflexible to keep
up with the rapidly
changing requirements of
the clients that access
them.
7. REST Drawbacks
● Poor data discovery
● Multiple fetches are common
● Fetching extraneous data
● No query validation
● Does not handle API deprecations, additions
and changes
8.
9. REST Example
● In a blogging application, an app needs to display the
titles of the posts of a specific user.
● The same screen also displays the names of the last 3
followers of that user.
Source: https://www.howtographql.com/basics/1-graphql-is-the-better-rest/
13. Problems with this Approach
● It took three HTTP requests to populate the
page (underfetching).
● Each request potentially returned more data
than was necessary (overfetching)
● New client data needs typically require new
endpoints (endpoint management)
15. Why is GitHub using GraphQL?
“GitHub chose GraphQL for our API v4 because it offers
significantly more flexibility for our integrators. The ability to
define precisely the data you want—and only the data you
want—is a powerful advantage over the REST API v3
endpoints. GraphQL lets you replace multiple REST requests
with a single call to fetch the data you specify.”
Source: https://developer.github.com/v4/
16. GraphQL at Netflix
“Since GraphQL allows the client to select only the data it
needs we end up fetching a significantly smaller payload. In
our application, pages that were fetching 10MB of data before
now receive about 200KB. Page loads became much faster,
especially over data-constrained mobile networks, and our
app uses much less memory. “
Source: https://medium.com/netflix-techblog/our-learnings-from-adopting-graphql-f099de39ae5f
17. GraphQL and REST
● Although GraphQL addresses many shortcomings of
REST, they can be easily used together.
● GraphQL API can be used as a facade with its resolvers
obtaining data using existing REST services.
● This is one way for an organization to incrementally adopt
GraphQL
Source: https://medium.com/netflix-techblog/our-learnings-from-adopting-graphql-f099de39ae5f
19. What is GraphQL
● GraphQL is a query language for your API, and a
server-side runtime for executing queries by
using a type system you define for your data.
● GraphQL isn't tied to any specific database or
storage engine and is instead backed by your
existing code and data.
20. ‘Graph’ in GraphQL
● You model your business domain as a graph by
defining a schema.
● Within your schema, you define different types of
nodes and how they connect/relate to one another.
21.
22. GraphQL Core Concepts
● The Schema Definition Language (SDL)
○ Queries to Fetch Data
○ Mutations to Update Data
○ Subscriptions for Real-Time Updates
● Resolvers to implement Queries, Mutations and
Subscriptions
23. GraphQL: Structure vs. Behavior
● Separating interface from implementation
● Interface (structure) is defined by the schema
● Implementation (behavior) is encapsulated in
resolvers.
27. What is a Schema?
● The GraphQL schema defines the server’s API
● The GraphQL schema provides a clear contract
for client-server communication
● GraphQL schemas are language-agnostic
● The main components of a schema definition are
the types and their fields
28. Example of Types and Fields
type Post {
id: String!
title: String!
publishedAt: DateTime!
likes: Int! @default(value: 0)
blog: Blog @relation(name: "Posts")
}
type Blog {
id: String!
name: String!
description: String,
posts: [Post!]! @relation(name: "Posts")
}
29. GraphQL Schema Types
● Object types represent a kind of object you can fetch from your
service, and what fields it has
● A type has a name and can implement one or more interfaces
● A field has a name and a type
● GraphQL supports built-in scalar types
● An enum is a scalar value that has a specified set of possible
values
● An Interface is an abstract type that includes a certain set of
fields that a type must include to implement the interface
● Custom types can be created using scalar types, enums, other
custom types and interfaces.
30. Anatomy of GraphQL Custom Type
type Character {
name: String!
appearsIn: [Episode]!
}
Type Name
Field
Field of Scalar Type
Field of Array Type
Non-nullable field
31. Built-in Scalar Types
● Int: A signed 32‐bit integer.
● Float: A signed double-precision floating-point value.
● String: A UTF‐8 character sequence.
● Boolean: true or false.
● ID: The ID scalar type represents a unique identifier
32. Enumeration Types
● Special kind of scalar that is restricted to a particular set of allowed values
● Validates that any arguments of this type are one of the allowed values
● Communicates through the type system that a field will always be one of a
finite set of values
enum Episode {
NEWHOPE
EMPIRE
JEDI
}
33. Interfaces
interface Character {
id: ID!
name: String!
friends: [Character]
appearsIn: [Episode]!
}
type Human implements Character {
id: ID!
name: String!
friends: [Character]
appearsIn: [Episode]!
starships: [Starship]
totalCredits: Int
}
type Droid implements Character {
id: ID!
name: String!
friends: [Character]
appearsIn: [Episode]!
primaryFunction: String
}
35. Query, Mutation and Subscription Types
● In addition to schema types defining the server-side domain
model, GraphQL defines special types, Query, Mutation and
Subscription
● These types define the server API in terms of domain types
● Query type specifies what queries clients can execute
● Mutation type defines create, update and delete operations
● Subscription defines the events the client can receive from the
server
37. GraphQL definition of type Lift
# A `Lift` is a chairlift, gondola, tram, funicular, pulley, rope tow, or other means of ascending a mountain.
type Lift {
# The unique identifier for a `Lift` (id: "panorama")
id: ID!
# The name of a `Lift`
name: String!
# The current status for a `Lift`: `OPEN`, `CLOSED`, `HOLD`
status: LiftStatus
# The number of people that a `Lift` can hold
capacity: Int!
# A boolean describing whether a `Lift` is open for night skiing
night: Boolean!
# The number of feet in elevation that a `Lift` ascends
elevationGain: Int!
# A list of trails that this `Lift` serves
trailAccess: [Trail!]!
}
38. GraphQL definition of type Trail
# A `Trail` is a run at a ski resort
type Trail {
# A unique identifier for a `Trail` (id: 'hemmed-slacks')
id: ID!
# The name of a `Trail`
name: String!
# The current status for a `Trail`: OPEN, CLOSED
status: TrailStatus
# The difficulty rating for a `Trail`
difficulty: String!
# A boolean describing whether or not a `Trail` is groomed
groomed: Boolean!
# A boolean describing whether or not a `Trail` has trees
trees: Boolean!
# A boolean describing whether or not a `Trail` is open for night skiing
night: Boolean!
# A list of Lifts that provide access to this `Trail`
accessedByLifts: [Lift!]!
}
39. GraphQL definition of enums LiftStatus and TrailStatus
# An enum describing the options for `LiftStatus`: `OPEN`, `CLOSED`, `HOLD`
enum LiftStatus {
OPEN
CLOSED
HOLD
}
# An enum describing the options for `TrailStatus`: `OPEN`, `CLOSED`
enum TrailStatus {
OPEN
CLOSED
}
40. GraphQL definition of API Queries against Lists and Trails
type Query {
# A list of all `Lift` objects
allLifts(status: LiftStatus): [Lift!]!
# A list of all `Trail` objects
allTrails(status: TrailStatus): [Trail!]!
# Returns a `Lift` by `id` (id: "panorama")
Lift(id: ID!): Lift!
# Returns a `Trail` by `id` (id: "old-witch")
Trail(id: ID!): Trail!
# Returns an `Int` of `Lift` objects by `LiftStatus`
liftCount(status: LiftStatus!): Int!
# Returns an `Int` of `Trail` objects by `TrailStatus`
trailCount(status: TrailStatus!): Int!
}
41. GraphQL definition of Mutations and Subscriptions
type Mutation {
"""
Sets a `Lift` status by sending `id` and `status`
"""
setLiftStatus(id: ID!, status: LiftStatus!): Lift!
"""
Sets a `Trail` status by sending `id` and `status`
"""
setTrailStatus(id: ID!, status: TrailStatus!): Trail!
}
type Subscription {
liftStatusChange: Lift
trailStatusChange: Trail
}
44. GraphQL Queries vs. REST Queries
● REST Server define multiple query endpoints,
each with a predefined data structure
● GraphQL Server provide a single query
endpoints with a flexible query structure
● Query structure in GraphQL is defined in terms
of Schema Types discussed previously
45. A simple GraphQL query and its result
{
hero {
name
}
}
{
"data": {
"hero": {
"name": "R2-D2"
}
}
}
46. A multi-level GraphQL query
{
hero {
name
# Queries can have comments!
friends {
name
}
}
}
{
"data": {
"hero": {
"name": "R2-D2",
"friends": [
{
"name": "Luke Skywalker"
},
{
"name": "Han Solo"
},
{
"name": "Leia Organa"
}
]
}
}
}
47. A named query with an argument
query GetReturnOfTheJedi {
film(id: "ZmlsbXM6Mw==") {
title
director
releaseDate
}
}
{
"data": {
"film": {
"title": "Return of the Jedi",
"director": "Richard Marquand",
"releaseDate": "1983-05-25"
}
}
}
48. A named query with a variable argument
query GetReturnOfTheJedi($id: ID) {
film(id: $id) {
title
director
releaseDate
}
}
{ "id": filmId }
{
"data": {
"film": {
"title": "Return of the Jedi",
"director": "Richard Marquand",
"releaseDate": "1983-05-25"
}
}
}
49. A query with a field alias
query GetTitles {
allFilms {
films {
filmTitle: title
}
}
}
{
"data": {
"allFilms": {
"films": [
{
"filmTitle": "A New Hope"
},
{
"filmTitle": "The Empire Strikes Back"
},
{
"filmTitle": "Return of the Jedi"
},
...
50. Using query fragments
query GetFilmInfo {
film1: film(id: "ZmlsbXM6NA==") {
title
director
producers
}
film2: film(id: "ZmlsbXM6Ng==") {
title
director
producers
}
}
query GetFilmInfo {
film1: film(id: "ZmlsbXM6NA==") {
...info
}
film2: film(id: "ZmlsbXM6Ng==") {
...info
}
}
fragment info on Film {
title
director
producers
}
51. @include directive
query GetTitles($includeDirector: Boolean!) {
allFilms {
films {
filmTitle: title
director @include(if: $includeDirector)
}
}
}
{
"data": {
"allFilms": {
"films": [
{
"filmTitle": "A New Hope",
"director": "George Lucas"
},
{
"filmTitle": "The Empire Strikes Back",
"director": "Irvin Kershner"
},
{
"filmTitle": "Return of the Jedi",
"director": "Richard Marquand"
},
...
53. Mutations Defined
● The purpose of mutations is
○ Creating new data
○ Updating existing data
○ Deleting data
● Syntax for mutations is similar to that of queries
○ name
○ arguments
○ return object with its fields
54. Named mutation with parameters
mutation CreateReviewForEpisode($ep: Episode!,
$review: ReviewInput!) {
createReview(episode: $ep, review: $review) {
stars
commentary
}
}
{
"ep": "JEDI",
"review": {
"stars": 5,
"commentary": "This is a great movie!"
}
}
{
"data": {
"createReview": {
"stars": 5,
"commentary": "This is a great
movie!"
}
}
}
56. Resolvers Defined
● Resolvers implement the API
● Each field in a GraphQL schema is backed by a
resolver
● Each resolver knows how to fetch the data for
its field.
57. GraphQL definition of API Queries against Lists and Trails
type Query {
# A list of all `Lift` objects
allLifts(status: LiftStatus): [Lift!]!
# A list of all `Trail` objects
allTrails(status: TrailStatus): [Trail!]!
# Returns a `Lift` by `id` (id: "panorama")
Lift(id: ID!): Lift!
# Returns a `Trail` by `id` (id: "old-witch")
Trail(id: ID!): Trail!
# Returns an `Int` of `Lift` objects by `LiftStatus`
liftCount(status: LiftStatus!): Int!
# Returns an `Int` of `Trail` objects by `TrailStatus`
trailCount(status: TrailStatus!): Int!
}
58. GraphQL resolvers for API Queries against Lists and Trails
module.exports = {
allLifts: (root, { status }, { lifts }) => {
if (!status) {
return lifts
} else {
var filteredLifts = lifts.filter(lift => lift.status === status)
return filteredLifts
}
},
allTrails: (root, { status }, { trails }) => {
if (!status) {
return trails
} else {
var filteredTrails = trails.filter(trail => trail.status === status)
return filteredTrails
}
},
Lift: (root, { id }, { lifts }) => {
var selectedLift = lifts.filter(lift => id === lift.id)
return selectedLift[0]
},
...
59. Resolver definitions continued
...
Trail: (root, { id }, { trails }) => {
var selectedTrail = trails.filter(trail => id === trail.id)
return selectedTrail[0]
},
liftCount: (root, { status }, { lifts }) => {
var i = 0
lifts.map(lift => {
lift.status === status ?
i++ :
null
})
return i
},
trailCount: (root, { status }, { trails }) => {
var i = 0
trails.map(trail => {
trail.status === status ?
i++ :
null
})
return i
}
}
60. Resolver Parameters
Resolvers get four parameters:
● Root - argument in each resolver call is simply the result of the previous
call (initial value is rootValue from the server configuration)
● Args - carries the parameters for the query
● Context - an object that gets passed through the resolver chain that each
resolver can write to and read from
● Info - an AST representation of the query or mutation
61. GraphQL definition of type Lift
# A `Lift` is a chairlift, gondola, tram, funicular, pulley, rope tow, or other means of ascending a mountain.
type Lift {
# The unique identifier for a `Lift` (id: "panorama")
id: ID!
# The name of a `Lift`
name: String!
# The current status for a `Lift`: `OPEN`, `CLOSED`, `HOLD`
status: LiftStatus
# The number of people that a `Lift` can hold
capacity: Int!
# A boolean describing whether a `Lift` is open for night skiing
night: Boolean!
# The number of feet in elevation that a `Lift` ascends
elevationGain: Int!
# A list of trails that this `Lift` serves
trailAccess: [Trail!]!
}
62. Resolvers for the field of the Lift type
module.exports = {
trailAccess: (root, args, { trails }) => root.trails
.map(id => trails.find(t => id === t.id))
.filter(x => x)
}
63. GraphQL definition of API Queries against Lists and Trails
type Mutation {
"""
Sets a `Lift` status by sending `id` and `status`
"""
setLiftStatus(id: ID!, status: LiftStatus!): Lift!
"""
Sets a `Trail` status by sending `id` and `status`
"""
setTrailStatus(id: ID!, status: TrailStatus!): Trail!
}
type Subscription {
liftStatusChange: Lift
trailStatusChange: Trail
}
64. Resolvers for Mutations
module.exports = {
setLiftStatus: (root, { id, status }, { lifts, pubsub }) => {
var updatedLift = lifts.find(lift => id === lift.id)
updatedLift.status = status
pubsub.publish('lift-status-change', { liftStatusChange: updatedLift })
return updatedLift
},
setTrailStatus: (root, { id, status }, { trails, pubsub }) => {
var updatedTrail = trails.find(trail => id === trail.id)
updatedTrail.status = status
pubsub.publish('trail-status-change', { trailStatusChange: updatedTrail })
return updatedTrail
}
}
67. GRANDstack Technologies
● GraphQL – a query language for APIs and a runtime for
fulfilling those queries with your existing data
● React – a JavaScript library for building user interfaces
● Apollo Client – a fully-featured, production-ready
caching GraphQL client for every server or UI framework
● Neo4j Database – a graph database that is
ACID-compliant and built to store and retrieve
connected data
69. GraphQL APIs have a strongly typed schema
A GraphQL schema is the backbone of every
GraphQL API. It clearly defines the operations
(queries, mutations and subscriptions)
supported by the API, including input arguments
and possible responses. The schema is an
unfailing contract that specifies the capabilities
of an API.
70. GraphQL APIs have a strongly typed schema
Developers don’t have to manually write API
documentation any more — instead it can be
auto-generated based on the schema that
defines the API
71. No more overfetching and underfetching
Clients can retrieve exactly the data they need
from the API. They don’t have to rely on REST
endpoints that return predefined and fixed data
structures. Instead, the client can dictate the
shape of the response objects returned by the
API.
72. GraphQL enables rapid product development
Thanks to GraphQL, client libraries (like Apollo,
Relay or Urql) frontend developers are getting
features like caching, realtime or optimistic UI
updates basically for free.
73. GraphQL enables rapid product development
● Schema-driven development is a process where
a feature is first defined in the schema, then
implemented with resolver functions.
● Tools like GraphQL Faker mocks the entire
GraphQL API (based on its schema definition),
so frontend and backend teams can work
completely independently.
74. Support for multiple server-side languages
● A GraphQL server can be implemented in any
programming language that can be used to build
a web server.
● Next to Javascript, there are popular reference
implementations for Ruby, Python, Scala, Java,
Clojure, Go and .NET.
75. Composing GraphQL APIs
● Schema stitching is combining and connecting
multiple GraphQL schemas (or schema
definitions) to create a single GraphQL API.
● Thanks to schema stitching, clients only deal
with a single API endpoint and all complexity of
orchestrating the communication with the
various services is hidden from the client.
76. Rich open-source ecosystem
● When it came out, the only tooling available for
developers to use GraphQL was the graphql-js
reference implementation, a piece of
middleware for Express.js, and the GraphQL
client Relay (Classic).
77. Rich open-source ecosystem
GraphQL Clients
● FetchQL
● GraphQL Request
● Apollo Fetch
● Lokka
● Micro GraphQL React
● URQL
● Apollo Client
● Relay Modern
78. Rich open-source ecosystem
GraphQL Tools
● Prisma - Simplified Database Access
● GraphQL Faker - Mock your future API or
extend the existing API with realistic data
● GraphQL Playground - GraphQL IDE
● Graphql-config - IDE configuration