Aboli Khairnar, Data Scientist, Citi Ventures Inc.
The majority of traditional corporate valuation methods are solely based on tangible indicators ' sales growth, gross profit, cash flow, and operational performance ' which are the result and don't reflect the underlying process which makes the organization successful in the longer run. One of the most important intangible assets that don't appear directly on the balance sheet is human capital. In this research, we focus on the role that workforce skills, education, and knowledge play towards organizational success using state-of-the-art Machine Learning techniques. We find that investments in human capital not only play a crucial role in organizational growth but also have a causal relationship.
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA
Mike Limcaco, Analytics Specialist / Customer Engineer at Google
Measure trends in a particular topic or search term across Google Search across the US down to the city-level. Integrate these data signals into analytic pipelines to drive product, retail, media (video, audio, digital content) recommendations tailored to your audience segment. We'll discuss how Google unique datasets can be used with Google Cloud smart analytic services to process, enrich and surface the most relevant product or content that matches the ever-changing interests of your local customer segment.
Melinda Thielbar, Data Science Practice Lead and Director of Data Science at Fidelity Investments
From corporations to governments to private individuals, most of the AI community has recognized the growing need to incorporate ethics into the development and maintenance of AI models. Much of the current discussion, though, is meant for leaders and managers. This talk is directed to data scientists, data engineers, ML Ops specialists, and anyone else who is responsible for the hands-on, day-to-day of work building, productionalizing, and maintaining AI models. We'll give a short overview of the business case for why technical AI expertise is critical to developing an AI Ethics strategy. Then we'll discuss the technical problems that cause AI models to behave unethically, how to detect problems at all phases of model development, and the tools and techniques that are available to support technical teams in Ethical AI development.
Data Con LA 2022 - Improving disaster response with machine learningData Con LA
Antje Barth, Principal Developer Advocate, AI/ML at AWS & Chris Fregly, Principal Engineer, AI & ML at AWS
The frequency and severity of natural disasters are increasing. In response, governments, businesses, nonprofits, and international organizations are placing more emphasis on disaster preparedness and response. Many organizations are accelerating their efforts to make their data publicly available for others to use. Repositories such as the Registry of Open Data on AWS and Humanitarian Data Exchange contain troves of data available for use by developers, data scientists, and machine learning practitioners. In this session, see how a community of developers came together though the AWS Disaster Response hackathon to build models to support natural disaster preparedness and response.
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA
Mike Limcaco, Analytics Specialist / Customer Engineer at Google
Measure trends in a particular topic or search term across Google Search across the US down to the city-level. Integrate these data signals into analytic pipelines to drive product, retail, media (video, audio, digital content) recommendations tailored to your audience segment. We'll discuss how Google unique datasets can be used with Google Cloud smart analytic services to process, enrich and surface the most relevant product or content that matches the ever-changing interests of your local customer segment.
Melinda Thielbar, Data Science Practice Lead and Director of Data Science at Fidelity Investments
From corporations to governments to private individuals, most of the AI community has recognized the growing need to incorporate ethics into the development and maintenance of AI models. Much of the current discussion, though, is meant for leaders and managers. This talk is directed to data scientists, data engineers, ML Ops specialists, and anyone else who is responsible for the hands-on, day-to-day of work building, productionalizing, and maintaining AI models. We'll give a short overview of the business case for why technical AI expertise is critical to developing an AI Ethics strategy. Then we'll discuss the technical problems that cause AI models to behave unethically, how to detect problems at all phases of model development, and the tools and techniques that are available to support technical teams in Ethical AI development.
Data Con LA 2022 - Improving disaster response with machine learningData Con LA
Antje Barth, Principal Developer Advocate, AI/ML at AWS & Chris Fregly, Principal Engineer, AI & ML at AWS
The frequency and severity of natural disasters are increasing. In response, governments, businesses, nonprofits, and international organizations are placing more emphasis on disaster preparedness and response. Many organizations are accelerating their efforts to make their data publicly available for others to use. Repositories such as the Registry of Open Data on AWS and Humanitarian Data Exchange contain troves of data available for use by developers, data scientists, and machine learning practitioners. In this session, see how a community of developers came together though the AWS Disaster Response hackathon to build models to support natural disaster preparedness and response.
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA
Sig Narvaez, Executive Solution Architect at MongoDB
MongoDB is now a Developer Data Platform. Come learn what�s new in the 6.0 release and Atlas following all the recent announcements made at MongoDB World 2022. Topics will include
- Atlas Search which combines 3 systems into one (database, search engine, and sync mechanisms) letting you focus on your product's differentiation.
- Atlas Data Federation to seamlessly query, transform, and aggregate data from one or more MongoDB Atlas databases, Atlas Data Lake and AWS S3 buckets
- Queryable Encryption lets you run expressive queries on fully randomized encrypted data to meet the most stringent security requirements
- Relational Migrator which analyzes your existing relational schemas and helps you design a new MongoDB schema.
- And more!
Data Con LA 2022 - Real world consumer segmentationData Con LA
Jaysen Gillespie, Head of Analytics and Data Science at RTB House
1. Shopkick has over 30M downloads, but the userbase is very heterogeneous. Anecdotal evidence indicated a wide variety of users for whom the app holds long-term appeal.
2. Marketing and other teams challenged Analytics to get beyond basic summary statistics and develop a holistic segmentation of the userbase.
3. Shopkick's data science team used SQL and python to gather data, clean data, and then perform a data-driven segmentation using a k-means algorithm.
4. Interpreting the results is more work -- and more fun -- than running the algo itself. We'll discuss how we transform from ""segment 1"", ""segment 2"", etc. to something that non-analytics users (Marketing, Operations, etc.) could actually benefit from.
5. So what? How did team across Shopkick change their approach given what Analytics had discovered.
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA
Ravi Pillala, Chief Data Architect & Distinguished Engineer at Intuit
TurboTax is one of the well known consumer software brand which at its peak serves 385K+ concurrent users. In this session, We start with looking at how user behavioral data & tax domain events are captured in real time using the event bus and analyzed to drive real time personalization with various TurboTax data pipelines. We will also look at solutions performing analytics which make use of these events, with the help of Kafka, Apache Flink, Apache Beam, Spark, Amazon S3, Amazon EMR, Redshift, Athena and Amazon lambda functions. Finally, we look at how SageMaker is used to create the TurboTax model to predict if a customer is at risk or needs help.
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA
George Mansoor, Chief Information Systems Officer at California State University
Overview of the CSU Data Architecture on moving on-prem ERP data to the AWS Cloud at scale using Delphix for Data Replication/Virtualization and AWS Data Migration Service (DMS) for data extracts
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA
Anand Ranganathan, Chief AI Officer at Unscrambl
Conversational AI is getting more and more widely used for customer support and employee support use-cases. In this session, I'm going to talk about how it can be extended for data analysis and data science use-cases ... i.e., how users can interact with a bot to ask analytical questions on data in relational databases.
This allows users to explore complex datasets using a combination of text and voice questions, in natural language, and then get back results in a combination of natural language and visualizations. Furthermore, it allows collaborative exploration of data by a group of users in a channel in platforms like Microsoft Teams, Slack or Google Chat.
For example, a group of users in a channel can ask questions to a bot in plain English like ""How many cases of Covid were there in the last 2 months by state and gender"" or ""Why did the number of deaths from Covid increase in May 2022"", and jointly look at the results that come back. This facilitates data awareness, data-driven collaboration and joint decision making among teams in enterprises and outside.
In this talk, I'll describe how we can bring together various features including natural-language understanding, NL-to-SQL translation, dialog management, data story-telling, semantic modeling of data and augmented analytics to facilitate collaborate exploration of data using conversational AI.
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA
Anil Inamdar, VP & Head of Data Solutions at Instaclustr
The most modernized enterprises utilize polyglot architecture, applying the best-suited database technologies to each of their organization's particular use cases. To successfully implement such an architecture, though, you need a thorough knowledge of the expansive NoSQL data technologies now available.
Attendees of this Data Con LA presentation will come away with:
-- A solid understanding of the decision-making process that should go into vetting NoSQL technologies and how to plan out their data modernization initiatives and migrations.
-- They will learn the types of functionality that best match the strengths of NoSQL key-value stores, graph databases, columnar databases, document-type databases, time-series databases, and more.
-- Attendees will also understand how to navigate database technology licensing concerns, and to recognize the types of vendors they'll encounter across the NoSQL ecosystem. This includes sniffing out open-core vendors that may advertise as “open source,"" but are driven by a business model that hinges on achieving proprietary lock-in.
-- Attendees will also learn to determine if vendors offer open-code solutions that apply restrictive licensing, or if they support true open source technologies like Hadoop, Cassandra, Kafka, OpenSearch, Redis, Spark, and many more that offer total portability and true freedom of use.
Data Con LA 2022 - Intro to Data ScienceData Con LA
Zia Khan, Computer Systems Analyst and Data Scientist at LearningFuze
Data Science tutorial is designed for people who are new to Data Science. This is a beginner level session so no prior coding or technical knowledge is required. Just bring your laptop with WiFi capability. The session starts with a review of what is data science, the amount of data we generate and how companies are using that data to get insight. We will pick a business use case, define the data science process, followed by hands-on lab using python and Jupyter notebook. During the hands-on portion we will work with pandas, numpy, matplotlib and sklearn modules and use a machine learning algorithm to approach the business use case.
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA
Mariana Danilovic, Managing Director at Infiom, LLC
We will address:
(1) Community creation and engagement using tokens and NFTs
(2) Organization of DAO structures and ways to incentivize Web3 communities
(3) DeFi business models applied to Web3 ventures
(4) Why Metaverse matters for new entertainment and community engagement models.
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA
Curtis ODell, Global Director Data Integrity at Tricentis
Join me to learn about a new end-to-end data testing approach designed for modern data pipelines that fills dangerous gaps left by traditional data management tools—one designed to handle structured and unstructured data from any source. You'll hear how you can use unique automation technology to reach up to 90 percent test coverage rates and deliver trustworthy analytical and operational data at scale. Several real world use cases from major banks/finance, insurance, health analytics, and Snowflake examples will be presented.
Key Learning Objective
1. Data journeys are complex and you have to ensure integrity of the data end to end across this journey from source to end reporting for compliance
2. Data Management tools do not test data, they profile and monitor at best, and leave serious gaps in your data testing coverage
3. Automation with integration to DevOps and DataOps' CI/CD processes are key to solving this.
4. How this approach has impact in your vertical
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA
Arif Ansari, Professor at University of Southern California
Super Bowl Ad cost $7 million and each year a few Super Bowl ads go viral. The traditional A/B testing does not predict virality. Some highly shared ones reach over 60 million organic views, which can be more valuable than views on TV. Not only are these voluntary, but they are typically without distraction, and win viewer engagement in the form of likes, comments, or shares. A Super Bowl ad that wins 69 million views on YouTube (e.g., Alexa Mind Reader) costs less than 10 cents per quality view! However, the challenge is triggering virality. We developed a method to predict virality and engineer virality into Ads.
1. Prof. Gerard J. Tellis and co-authors recommended that advertisers use YouTube to tease, test, and tweak (TTT) their ads to maximize sharing and viewing. 2022 saw that maxim put into practice.
2. We developed viral Ads prediction using two scientific models:
a. Prof. Gerard Tellis et al.'s model for viral prediction
b. Deep Learning viral prediction using social media effect
3. The model was able to identify all the top 15 Viral Ads it performed better than the traditional agencies.
4. New proposed method is Tease, Test, Tweak, Target and Spots Ad.
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA
Jai Bansal, Senior Manager, Data Science at Aetna
This talk describes an internal data product called Member Embeddings that facilitates modeling of member medical journeys with machine learning.
Medical claims are the key data source we use to understand health journeys at Aetna. Claims are the data artifacts that result from our members' interactions with the healthcare system. Claims contain data like the amount the provider billed, the place of service, and provider specialty. The primary medical information in a claim is represented in codes that indicate the diagnoses, procedures, or drugs for which a member was billed. These codes give us a semi-structured view into the medical reason for each claim and so contain rich information about members' health journeys. However, since the codes themselves are categorical and high-dimensional (10K cardinality), it's challenging to extract insight or predictive power directly from the raw codes on a claim.
To transform claim codes into a more useful format for machine learning, we turned to the concept of embeddings. Word embeddings are widely used in natural language processing to provide numeric vector representations of individual words.
We use a similar approach with our claims data. We treat each claim code as a word or token and use embedding algorithms to learn lower-dimensional vector representations that preserve the original high-dimensional semantic meaning.
This process converts the categorical features into dense numeric representations. In our case, we use sequences of anonymized member claim diagnosis, procedure, and drug codes as training data. We tested a variety of algorithms to learn embeddings for each type of claim code.
We found that the trained embeddings showed relationships between codes that were reasonable from the point of view of subject matter experts. In addition, using the embeddings to predict future healthcare-related events outperformed other basic features, making this tool an easy way to improve predictive model performance and save data scientist time.
Data Con LA 2022 - Data Streaming with KafkaData Con LA
Jie Chen, Manager Advisory, KPMG
Data is the new oil. However, many organizations have fragmented data in siloed line of businesses. In this topic, we will focus on identifying the legacy patterns and their limitations and introducing the new patterns packed by Kafka's core design ideas. The goal is to tirelessly pursue better solutions for organizations to overcome the bottleneck in data pipelines and modernize the digital assets for ready to scale their businesses. In summary, we will walk through three uses cases, recommend Dos and Donts, Take aways for Data Engineers, Data Scientist, Data architect in developing forefront data oriented skills.
Data Con LA 2022 - Building Field-level Lineage from Scratch for Modern Data ...Data Con LA
Xuanzi Han, Senior Software Engineer at Monte Carlo
For modern data teams, lineage is a critical component of the data pipeline root cause and impact analysis workflow, as well as a means of ensuring that data, models, and other data assets are healthy and reliable. That being said, the complexity of SQL queries can make it challenging to build lineage manually, particularly at the field level. Xuanzi Han, a member of Monte Carlo's data and product teams, tackled this challenge head-on by leveraging some of the most popular tools in the modern data stack, including dbt, Airflow, Snowflake, and ANother Tool for Language Recognition (ANTLR). In this talk, they share how they designed the data model, query parser, and larger database design for field-level lineage, highlighting learnings, wrong turns, and best practices developed along the way.
Data Con LA 2022 - Finding true purpose after falling to addiction, and inspi...Data Con LA
David Sarabia, Founder/ CEO at inRecovery & Sig Narvaez, Executive Solution Architect at MongoDB
As a bullied kid, I found refuge in computers and taught myself to code at 8. By 26, I had two successful tech exits and moved to NYC. A weekend party habit led to daily drug use and a spiral to heroin and homelessness. In 2016, after a friend�s overdose woke me up. I checked myself into rehab and quickly realized I was there for a bigger purpose.
Healthcare is very broken. From legacy systems, inefficiencies, and poor customer experience. What if we could dramatically improve care models by leveraging data, personalizing treatment, and creating beautiful patient experiences?
Ever worked in an industry that felt antiquated? Learn how we use MongoDB to transform addiction care and help people thrive in life!
Data Con LA 2022 - Supercharge your Snowflake Data Cloud from a Snowflake Dat...Data Con LA
Frank Bell, Data Thought Leader and Snowflake SME at Accenture - CEO at ITS
We will cover all aspects of optimizing your Snowflake Data Cloud including:
*Dive deep into how Snowflake pay as you go costs work and how by utilizing our proven optimization tools - Snoptimizer SaaS Snowflake Optimizer - https://snoptimizer.com/
, scripts, and architecture techniques you typically can save 10-40++% on your existing Snowflake Account costs.
*Explain how Snowflake Compute works and proven techniques on how to architect warehouses for both cost and performance efficiency. We cover in depth how snowflake scales BOTH out and in as well as up and down with compute resources.
*Explain how Snowflake data storage works with Replication, Time-Travel, and Cloning. We explain these awesome features as well as their downsides if they are used and configured wrongly.
*Cover Snowflake cloud services costs and features that have costs related to them, including Snowpipe, Search Optimization, Materialized Views, Auto-clustering, and other recent new cost based features that provide value at a cost.
*Finally, we will discuss how you can ensure your Snowflake Account(s) are fully optimized not just for cost but also for security and performance on Snowflake. We will show you security and performance best practices as well as pitfalls to avoid.
Data Con LA 2022 - The Evolution of AI in CybersecurityData Con LA
Michael Melore, Senior Cybersecurity Advisor, IBM
The session will include views from the panel (and myself) * Review the current challenges, volumes of events, staffing shortages, expertise deficiencies, siloed security controls, * Provide statistics from recent Ponemon Institute reports including the recent Cost of a Data Breach 2021 Report's findings in attack vectors, response/organizational impact and costs attributed to remote workforces, * Provide The impact in cost and response times of AI/Machine Learning etc. * Share the way's AI is used in law enforcement and critical infrastructure protection, * Discuss AI bias and evolving Trust and Validation requirements in AI systems, the necessity and value of AI insight to security and where the industry is moving in AI for security.
Data Con LA 2022 - Who Owns That Yacht? How Graphs Are Used to Identify Asset...Data Con LA
Mark Quinsland, Sr. Field Engineer at Neo4j
Luxury yachts, football teams, and mansions are no longer safe havens for the illicit profits of Russian Oligarchs with ties to Putin. Assets are being identified and seized with benefits flowing to causes in Ukraine. This presentation covers:
- How are friends and relatives of Putin sheltering immense profits
- Graphs and other tools being used to identify sources & destinations of illicit wealth
- Latest asset seizures
- New regulations to expose hidden investors
Data Con LA 2022 - Event Sourcing with Apache Pulsar and Apache QuarkusData Con LA
David Kjerrumgaard, Developer Advocate, StreamNative
I believe that event-sourcing is the best way to implement persistence within a microservices architecture, but it hasn't always been the easiest solution to implement. In this talk, I will demonstrate how these two exciting technologies can be combined into one killer stack that simplifies event sourcing development. I will outline how to use DDD and CQRS concepts as a guide for developing an event sourcing food-delivery application based on Apache Pulsar and Quarkus that is 100% cloud native. Throughout this talk, I will demonstrate several different event sourcing design patterns across multiple microservices to feed multiple real-time dashboards that provide driver location tracking, and heatmaps. I will also highlight some patterns for using an event streaming platform as your event store.
Data Con LA 2022 - Customer-Driven Data EngineeringData Con LA
Emad Georgy, CTO, Georgy Technology Leadership
Getting customers engaged and excited about data architecture plans How to integrate UX practices into Data Engineering Data Governance is bullshit - why? Applying performance, scale and usability tests to your Data Engineering journey
Data Con LA 2022 - Early cancer detection using higher-order genome architectureData Con LA
My (Angela) Chung, Data Enthusiast, San Jose State University
Cancer is a complex disease which requires interactions between cell-intrinsic alterations and tumor microenvironment. The connection between epigenetics and genomic structure plays a key role in chromatin interactions and enhancer-promoter communications for transcriptional activities. Alterations of these components in oncogenic signaling pathway potentially cause cancer cell-intrinsic changes and inappropriate instructions to normal cell cycles, leading to abnormal cell growth.
' Topologically associating domains (TADs) and A/B compartments are the main structures of higher-order chromatin structure. These contact domains, chromatin states, super-enhancers, and histone modifications together regulate transcription and gene expression for normal/abnormal cell cycles.
' Several bioinformatics tools were utilized ' FANC for processing raw FASTQ data to Hi-C contact matrices, JuicerTools for obtaining the locations of contact domains on the entire genome, and CoolBox for visualizing chromatin contacts in different cell lines.
' High-resolution chromatin contacts showed dynamic interactions among chromosomal regions in different cell lines.
' Qualitative and quantitative features were comprehensively engineered from 3D chromatin folding and epigenetic regulators using available packages (scikit learn, pytorch, pandas, numpy, matplotlib, etc.).
' XGBoost multi-class classifier achieved the highest accuracy of 80.90% in classifying normal and cancer cell lines based on chromatin interactions, followed by Random Forest at 73.76% and TabNet classifier at 70.00%.
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA
Sig Narvaez, Executive Solution Architect at MongoDB
MongoDB is now a Developer Data Platform. Come learn what�s new in the 6.0 release and Atlas following all the recent announcements made at MongoDB World 2022. Topics will include
- Atlas Search which combines 3 systems into one (database, search engine, and sync mechanisms) letting you focus on your product's differentiation.
- Atlas Data Federation to seamlessly query, transform, and aggregate data from one or more MongoDB Atlas databases, Atlas Data Lake and AWS S3 buckets
- Queryable Encryption lets you run expressive queries on fully randomized encrypted data to meet the most stringent security requirements
- Relational Migrator which analyzes your existing relational schemas and helps you design a new MongoDB schema.
- And more!
Data Con LA 2022 - Real world consumer segmentationData Con LA
Jaysen Gillespie, Head of Analytics and Data Science at RTB House
1. Shopkick has over 30M downloads, but the userbase is very heterogeneous. Anecdotal evidence indicated a wide variety of users for whom the app holds long-term appeal.
2. Marketing and other teams challenged Analytics to get beyond basic summary statistics and develop a holistic segmentation of the userbase.
3. Shopkick's data science team used SQL and python to gather data, clean data, and then perform a data-driven segmentation using a k-means algorithm.
4. Interpreting the results is more work -- and more fun -- than running the algo itself. We'll discuss how we transform from ""segment 1"", ""segment 2"", etc. to something that non-analytics users (Marketing, Operations, etc.) could actually benefit from.
5. So what? How did team across Shopkick change their approach given what Analytics had discovered.
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA
Ravi Pillala, Chief Data Architect & Distinguished Engineer at Intuit
TurboTax is one of the well known consumer software brand which at its peak serves 385K+ concurrent users. In this session, We start with looking at how user behavioral data & tax domain events are captured in real time using the event bus and analyzed to drive real time personalization with various TurboTax data pipelines. We will also look at solutions performing analytics which make use of these events, with the help of Kafka, Apache Flink, Apache Beam, Spark, Amazon S3, Amazon EMR, Redshift, Athena and Amazon lambda functions. Finally, we look at how SageMaker is used to create the TurboTax model to predict if a customer is at risk or needs help.
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA
George Mansoor, Chief Information Systems Officer at California State University
Overview of the CSU Data Architecture on moving on-prem ERP data to the AWS Cloud at scale using Delphix for Data Replication/Virtualization and AWS Data Migration Service (DMS) for data extracts
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA
Anand Ranganathan, Chief AI Officer at Unscrambl
Conversational AI is getting more and more widely used for customer support and employee support use-cases. In this session, I'm going to talk about how it can be extended for data analysis and data science use-cases ... i.e., how users can interact with a bot to ask analytical questions on data in relational databases.
This allows users to explore complex datasets using a combination of text and voice questions, in natural language, and then get back results in a combination of natural language and visualizations. Furthermore, it allows collaborative exploration of data by a group of users in a channel in platforms like Microsoft Teams, Slack or Google Chat.
For example, a group of users in a channel can ask questions to a bot in plain English like ""How many cases of Covid were there in the last 2 months by state and gender"" or ""Why did the number of deaths from Covid increase in May 2022"", and jointly look at the results that come back. This facilitates data awareness, data-driven collaboration and joint decision making among teams in enterprises and outside.
In this talk, I'll describe how we can bring together various features including natural-language understanding, NL-to-SQL translation, dialog management, data story-telling, semantic modeling of data and augmented analytics to facilitate collaborate exploration of data using conversational AI.
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA
Anil Inamdar, VP & Head of Data Solutions at Instaclustr
The most modernized enterprises utilize polyglot architecture, applying the best-suited database technologies to each of their organization's particular use cases. To successfully implement such an architecture, though, you need a thorough knowledge of the expansive NoSQL data technologies now available.
Attendees of this Data Con LA presentation will come away with:
-- A solid understanding of the decision-making process that should go into vetting NoSQL technologies and how to plan out their data modernization initiatives and migrations.
-- They will learn the types of functionality that best match the strengths of NoSQL key-value stores, graph databases, columnar databases, document-type databases, time-series databases, and more.
-- Attendees will also understand how to navigate database technology licensing concerns, and to recognize the types of vendors they'll encounter across the NoSQL ecosystem. This includes sniffing out open-core vendors that may advertise as “open source,"" but are driven by a business model that hinges on achieving proprietary lock-in.
-- Attendees will also learn to determine if vendors offer open-code solutions that apply restrictive licensing, or if they support true open source technologies like Hadoop, Cassandra, Kafka, OpenSearch, Redis, Spark, and many more that offer total portability and true freedom of use.
Data Con LA 2022 - Intro to Data ScienceData Con LA
Zia Khan, Computer Systems Analyst and Data Scientist at LearningFuze
Data Science tutorial is designed for people who are new to Data Science. This is a beginner level session so no prior coding or technical knowledge is required. Just bring your laptop with WiFi capability. The session starts with a review of what is data science, the amount of data we generate and how companies are using that data to get insight. We will pick a business use case, define the data science process, followed by hands-on lab using python and Jupyter notebook. During the hands-on portion we will work with pandas, numpy, matplotlib and sklearn modules and use a machine learning algorithm to approach the business use case.
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA
Mariana Danilovic, Managing Director at Infiom, LLC
We will address:
(1) Community creation and engagement using tokens and NFTs
(2) Organization of DAO structures and ways to incentivize Web3 communities
(3) DeFi business models applied to Web3 ventures
(4) Why Metaverse matters for new entertainment and community engagement models.
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA
Curtis ODell, Global Director Data Integrity at Tricentis
Join me to learn about a new end-to-end data testing approach designed for modern data pipelines that fills dangerous gaps left by traditional data management tools—one designed to handle structured and unstructured data from any source. You'll hear how you can use unique automation technology to reach up to 90 percent test coverage rates and deliver trustworthy analytical and operational data at scale. Several real world use cases from major banks/finance, insurance, health analytics, and Snowflake examples will be presented.
Key Learning Objective
1. Data journeys are complex and you have to ensure integrity of the data end to end across this journey from source to end reporting for compliance
2. Data Management tools do not test data, they profile and monitor at best, and leave serious gaps in your data testing coverage
3. Automation with integration to DevOps and DataOps' CI/CD processes are key to solving this.
4. How this approach has impact in your vertical
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA
Arif Ansari, Professor at University of Southern California
Super Bowl Ad cost $7 million and each year a few Super Bowl ads go viral. The traditional A/B testing does not predict virality. Some highly shared ones reach over 60 million organic views, which can be more valuable than views on TV. Not only are these voluntary, but they are typically without distraction, and win viewer engagement in the form of likes, comments, or shares. A Super Bowl ad that wins 69 million views on YouTube (e.g., Alexa Mind Reader) costs less than 10 cents per quality view! However, the challenge is triggering virality. We developed a method to predict virality and engineer virality into Ads.
1. Prof. Gerard J. Tellis and co-authors recommended that advertisers use YouTube to tease, test, and tweak (TTT) their ads to maximize sharing and viewing. 2022 saw that maxim put into practice.
2. We developed viral Ads prediction using two scientific models:
a. Prof. Gerard Tellis et al.'s model for viral prediction
b. Deep Learning viral prediction using social media effect
3. The model was able to identify all the top 15 Viral Ads it performed better than the traditional agencies.
4. New proposed method is Tease, Test, Tweak, Target and Spots Ad.
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA
Jai Bansal, Senior Manager, Data Science at Aetna
This talk describes an internal data product called Member Embeddings that facilitates modeling of member medical journeys with machine learning.
Medical claims are the key data source we use to understand health journeys at Aetna. Claims are the data artifacts that result from our members' interactions with the healthcare system. Claims contain data like the amount the provider billed, the place of service, and provider specialty. The primary medical information in a claim is represented in codes that indicate the diagnoses, procedures, or drugs for which a member was billed. These codes give us a semi-structured view into the medical reason for each claim and so contain rich information about members' health journeys. However, since the codes themselves are categorical and high-dimensional (10K cardinality), it's challenging to extract insight or predictive power directly from the raw codes on a claim.
To transform claim codes into a more useful format for machine learning, we turned to the concept of embeddings. Word embeddings are widely used in natural language processing to provide numeric vector representations of individual words.
We use a similar approach with our claims data. We treat each claim code as a word or token and use embedding algorithms to learn lower-dimensional vector representations that preserve the original high-dimensional semantic meaning.
This process converts the categorical features into dense numeric representations. In our case, we use sequences of anonymized member claim diagnosis, procedure, and drug codes as training data. We tested a variety of algorithms to learn embeddings for each type of claim code.
We found that the trained embeddings showed relationships between codes that were reasonable from the point of view of subject matter experts. In addition, using the embeddings to predict future healthcare-related events outperformed other basic features, making this tool an easy way to improve predictive model performance and save data scientist time.
Data Con LA 2022 - Data Streaming with KafkaData Con LA
Jie Chen, Manager Advisory, KPMG
Data is the new oil. However, many organizations have fragmented data in siloed line of businesses. In this topic, we will focus on identifying the legacy patterns and their limitations and introducing the new patterns packed by Kafka's core design ideas. The goal is to tirelessly pursue better solutions for organizations to overcome the bottleneck in data pipelines and modernize the digital assets for ready to scale their businesses. In summary, we will walk through three uses cases, recommend Dos and Donts, Take aways for Data Engineers, Data Scientist, Data architect in developing forefront data oriented skills.
Data Con LA 2022 - Building Field-level Lineage from Scratch for Modern Data ...Data Con LA
Xuanzi Han, Senior Software Engineer at Monte Carlo
For modern data teams, lineage is a critical component of the data pipeline root cause and impact analysis workflow, as well as a means of ensuring that data, models, and other data assets are healthy and reliable. That being said, the complexity of SQL queries can make it challenging to build lineage manually, particularly at the field level. Xuanzi Han, a member of Monte Carlo's data and product teams, tackled this challenge head-on by leveraging some of the most popular tools in the modern data stack, including dbt, Airflow, Snowflake, and ANother Tool for Language Recognition (ANTLR). In this talk, they share how they designed the data model, query parser, and larger database design for field-level lineage, highlighting learnings, wrong turns, and best practices developed along the way.
Data Con LA 2022 - Finding true purpose after falling to addiction, and inspi...Data Con LA
David Sarabia, Founder/ CEO at inRecovery & Sig Narvaez, Executive Solution Architect at MongoDB
As a bullied kid, I found refuge in computers and taught myself to code at 8. By 26, I had two successful tech exits and moved to NYC. A weekend party habit led to daily drug use and a spiral to heroin and homelessness. In 2016, after a friend�s overdose woke me up. I checked myself into rehab and quickly realized I was there for a bigger purpose.
Healthcare is very broken. From legacy systems, inefficiencies, and poor customer experience. What if we could dramatically improve care models by leveraging data, personalizing treatment, and creating beautiful patient experiences?
Ever worked in an industry that felt antiquated? Learn how we use MongoDB to transform addiction care and help people thrive in life!
Data Con LA 2022 - Supercharge your Snowflake Data Cloud from a Snowflake Dat...Data Con LA
Frank Bell, Data Thought Leader and Snowflake SME at Accenture - CEO at ITS
We will cover all aspects of optimizing your Snowflake Data Cloud including:
*Dive deep into how Snowflake pay as you go costs work and how by utilizing our proven optimization tools - Snoptimizer SaaS Snowflake Optimizer - https://snoptimizer.com/
, scripts, and architecture techniques you typically can save 10-40++% on your existing Snowflake Account costs.
*Explain how Snowflake Compute works and proven techniques on how to architect warehouses for both cost and performance efficiency. We cover in depth how snowflake scales BOTH out and in as well as up and down with compute resources.
*Explain how Snowflake data storage works with Replication, Time-Travel, and Cloning. We explain these awesome features as well as their downsides if they are used and configured wrongly.
*Cover Snowflake cloud services costs and features that have costs related to them, including Snowpipe, Search Optimization, Materialized Views, Auto-clustering, and other recent new cost based features that provide value at a cost.
*Finally, we will discuss how you can ensure your Snowflake Account(s) are fully optimized not just for cost but also for security and performance on Snowflake. We will show you security and performance best practices as well as pitfalls to avoid.
Data Con LA 2022 - The Evolution of AI in CybersecurityData Con LA
Michael Melore, Senior Cybersecurity Advisor, IBM
The session will include views from the panel (and myself) * Review the current challenges, volumes of events, staffing shortages, expertise deficiencies, siloed security controls, * Provide statistics from recent Ponemon Institute reports including the recent Cost of a Data Breach 2021 Report's findings in attack vectors, response/organizational impact and costs attributed to remote workforces, * Provide The impact in cost and response times of AI/Machine Learning etc. * Share the way's AI is used in law enforcement and critical infrastructure protection, * Discuss AI bias and evolving Trust and Validation requirements in AI systems, the necessity and value of AI insight to security and where the industry is moving in AI for security.
Data Con LA 2022 - Who Owns That Yacht? How Graphs Are Used to Identify Asset...Data Con LA
Mark Quinsland, Sr. Field Engineer at Neo4j
Luxury yachts, football teams, and mansions are no longer safe havens for the illicit profits of Russian Oligarchs with ties to Putin. Assets are being identified and seized with benefits flowing to causes in Ukraine. This presentation covers:
- How are friends and relatives of Putin sheltering immense profits
- Graphs and other tools being used to identify sources & destinations of illicit wealth
- Latest asset seizures
- New regulations to expose hidden investors
Data Con LA 2022 - Event Sourcing with Apache Pulsar and Apache QuarkusData Con LA
David Kjerrumgaard, Developer Advocate, StreamNative
I believe that event-sourcing is the best way to implement persistence within a microservices architecture, but it hasn't always been the easiest solution to implement. In this talk, I will demonstrate how these two exciting technologies can be combined into one killer stack that simplifies event sourcing development. I will outline how to use DDD and CQRS concepts as a guide for developing an event sourcing food-delivery application based on Apache Pulsar and Quarkus that is 100% cloud native. Throughout this talk, I will demonstrate several different event sourcing design patterns across multiple microservices to feed multiple real-time dashboards that provide driver location tracking, and heatmaps. I will also highlight some patterns for using an event streaming platform as your event store.
Data Con LA 2022 - Customer-Driven Data EngineeringData Con LA
Emad Georgy, CTO, Georgy Technology Leadership
Getting customers engaged and excited about data architecture plans How to integrate UX practices into Data Engineering Data Governance is bullshit - why? Applying performance, scale and usability tests to your Data Engineering journey
Data Con LA 2022 - Early cancer detection using higher-order genome architectureData Con LA
My (Angela) Chung, Data Enthusiast, San Jose State University
Cancer is a complex disease which requires interactions between cell-intrinsic alterations and tumor microenvironment. The connection between epigenetics and genomic structure plays a key role in chromatin interactions and enhancer-promoter communications for transcriptional activities. Alterations of these components in oncogenic signaling pathway potentially cause cancer cell-intrinsic changes and inappropriate instructions to normal cell cycles, leading to abnormal cell growth.
' Topologically associating domains (TADs) and A/B compartments are the main structures of higher-order chromatin structure. These contact domains, chromatin states, super-enhancers, and histone modifications together regulate transcription and gene expression for normal/abnormal cell cycles.
' Several bioinformatics tools were utilized ' FANC for processing raw FASTQ data to Hi-C contact matrices, JuicerTools for obtaining the locations of contact domains on the entire genome, and CoolBox for visualizing chromatin contacts in different cell lines.
' High-resolution chromatin contacts showed dynamic interactions among chromosomal regions in different cell lines.
' Qualitative and quantitative features were comprehensively engineered from 3D chromatin folding and epigenetic regulators using available packages (scikit learn, pytorch, pandas, numpy, matplotlib, etc.).
' XGBoost multi-class classifier achieved the highest accuracy of 80.90% in classifying normal and cancer cell lines based on chromatin interactions, followed by Random Forest at 73.76% and TabNet classifier at 70.00%.
Data Con LA 2022 - Early cancer detection using higher-order genome architecture
Data Con LA 2022 - Human Capital Growth Analytics
1. Human Capital Growth Analytics
DataconLA | August 2022
Citi Ventures| Venture Innovation
2. Intangible assets of an enterprise that are required to
achieve business goals, including employee's
knowledge; data and information about processes,
products, customers and competitors; and intellectual
property such as patents or regulatory licenses.
(Source: Gartner)
Intellectual/ Knowledge Capital
Jul 14, 2022
3. Intellectual/ Knowledge Capital
Intellectual Capital
Human capital Relational capital Structural capital
• The contributions made to an organization
by its employees utilizing their talents,
skills, and expertise.
• Possessed only by individuals but may be
harnessed by an organization.
• Provides a comparative advantage
• Quality organizations are ones that focus
on retaining creative and innovative
workers, as well as work toward creating a
setting where such intelligence can be
taught and learned.
• The relationships between
coworkers as well as those
between workers and suppliers,
customers, partners, and
collaborators.
• Relationship capital also includes
franchises, licenses, and
trademarks as they have value
only in the context of the
relationship they have with
customers.
• The non-physical capital possessed
by an organization—such as
processes, method, and techniques—
that allow it to operate and enable it to
leverage its capabilities.
• Structural capital may include
intellectual property such as
databases, code, patents, proprietary
processes, trademarks, software, and
more.
5. Assumptions
5
Source: Data entry best practices
Data Accuracy
Source: Data entry best practices
No Spillover Effects for R&D Data Missing at Random
Constant Workplace Productivity
Source: Aimstyle
Source: Towards Data Science
Source: Getty Images
6. Data Science Workflow
6
Data Preprocessing
Modeling
Recommendations
Financial data
Employee data
Data Cleaning
Classification using
Random Forest
Human Capital(HC)
Index1
Customized HC
investment2
1. Index is based on the probability which is calculated using importance of each feature as well as SHAP values which depicts how that feature affects the output in presence of other factors. In
other words, it is a number between 0-100, calculated based on the model output which is also the probability of firm being a high growth firm (a firm having growth rate > 2.5%). If our model
predicts the probability output as 0.5, then the Human Capital Index for that firm will be (100*0.5) = 50
2. Customized HC investments are based on SHapely Additive exPlanations (SHAP) values for those human capital factors. The goal of customized investments is to highlight critical human
capital factors along with their optimum allocations aimed to increase the total revenue growth rate for a firm
7. Data Sources1
Employee data Financial data
Salary, Attrition, Gender, Race, Job Titles, Skills etc. Revenue, Profitability, Return on Equity, Stock Price etc.
Missing Data
– Majority of firms in the financial sector don’t report R&D
– Around 16% of R&D data is filled-in using various computing techniques2
Data: ~2150 US firms
7
1. Employee and financial data are provided by Revelio Labs and S&P financials database Capital IQ respectively
8. Classification Using H20 AutoML
R&D Expense
as of Total Revenue
Location
Firm Size
Sector
Employees across top
Skill Clusters
Employees in
Technical roles
Is it a High Growth firm?
• SAP/ Accounting/ Financial Reporting
• SQL/ Linux/ Development
• Marketing/ Social Media/ Marketing Strategy
• Data Analysis/ Databases/ Data Warehousing
• Java/ C++/ C
• Analysis/ Financial Analysis/ Finance
• Human Resources/ Recruiting/ Performance
Management/
• Program Management/ Security/ Engineering
management
• Information technology/ people management/
project management
Project Manager, Database Administrator, Test
Engineer, Data Analyst, Graphic Designer,
Software Developer, IT Project Manager, Web
Developer, Application Engineer, Infrastructure
Engineer, Information Security, Quality Engineer,
Scientist, Data Engineer, UX Designer, Systems
Engineer, Software Engineer, Data Scientist,
Business Analyst, Automation Engineer,
Technology Lead, IT Analyst, DevOps Engineer,
Product Manager, Technical Support Engineer,
Technology Analyst, Technical Architect
10. 10
1. Data Skills: Skills specific to Data Science/ Data Engineering roles; Software Development Skills: Skills specific to Software Engineering/ Developer roles
2. Partial Dependence Plot (PDP) shows marginal effect of a variable in mean response by assuming independence between feature for which PDP is calculated and the rest
3. Follows law of diminishing marginal utility – marginal utility increases at decreasing rate
4. X-axis description:
• For Data & Analytics, Software Development, Engineering Management skill clusters, X axis refers to % of employees in that skill cluster
• For R&D, it refers to %R&D expense as of Total Revenue
Key factors and their link to firms’ financial performance
Partial Dependence Plots (PDP2)
– Shows how each human
capital factor affects average
prediction about firm’s
performance
– Employees with Data1 skills
shows initial increase3 the
average output probability of
firm being a high growth firm
Probability
of
firm
being
high
growth
Employees within Technical and Engineering Management skill clusters along
with R&D increase the average probability of a firm being a high growth
Relationship between Human Capital Factors and firms' Financial Performance
% Workforce/Spend4
11. Company % R&D
% Tech
roles
% Employees by Skill Clusters
Sector State
HC
Index
Software
Development
Data
Analytics
Engineering
Management
A 42 50 87 28 15
Information
Technology
California 100
B 20 48 67 18 10
Information
Technology
California 97
C 20 48 65 15 15
Information
Technology
Massachusetts 97
D 21 49 54 15 15
Information
Technology
New York 91
E 19 43 9 14 5 Health care California 91
F 11 32 9 12 5 Health care Massachusetts 66
G 11 17 9 4 3
Communication
Services
California 54
H 1 12 2 3 2 Consumer Staples Michigan 22
I 3 19 4 4 4
Consumer
Discretionary
Michigan 20
J 1 16 5 6 5 Financials Massachusetts 18
K 2 15 5 3 3
Consumer
Discretionary
Texas 13
L 2 20 8 7 5 Financials New York 9
M 1 8 0 1 3 Industrials Florida 6
Human Capital Growth Index – Sample Output
The following table ranks a subset of companies in the US based on their HC index.
• The top-rated firm from the predicted index belongs to the IT sector ad located in California.
• The lowest rated firm belongs to the Industrials sector and is in Florida.
• Overall, the index values are specifically high for IT and Health care firms with High R&D and Tech workforce.
High
Growth
Low
Growth
12. Moderate correlation between Human Capital Index and Revenue Growth Rate
Human
Capital
Index
Total Revenue Growth Rate
Correlation 54%*
*Shown in percentage (0.54*100).
Ranges for Correlation Coefficient:
0.9 - 1 (Very high); 0.7 - 0.9 (High); 0.5 - 0.7 (Moderate); 0.3 - 0.5 (Low); 0 - 0.3 (Very Low)
Source: https://towardsdatascience.com/eveything-you-need-to-know-about-interpreting-correlations-2c485841c0b8
13. % IT/ People/Project Management
% Technical workforce
% HR/ Performance Management
% Program/ Engineering management
% R&D
% Java/ C/ C++
% SQL/ Linux/ Software Development
% Data Analysis/ Data Engineering
Contribution of Human Capital Factors towards Revenue
Growth Rate
BottomUp Solutions Vs. Apexify Labs
Compared to Apexify Labs, BottomUp Solutions continues to underutilize its
technical workforce and R&D spend
13
Case study
BottomUp Solutions Vs. Apexify Labs
Total Revenue Growth Rate (CY 2020)
– BottomUp Solutions: -5.5 %
– Apexify Labs : 12%
– For BottomUp Solutions, contributions1 of
all selected human capital factors is
consistently lower compared to that of
Apexify Labs
– The difference is significant2 for R&D
Expense and % employees across
Technical roles, Software Engineering,
Data, and Engineering Management skills
clusters.
– As per our hypothesis, there is scope to
improve BottomUp Solutions’s revenue
growth rate by increasing the
contribution of some of these human
capital factors
Note: Apexify Labs and BottomUp Solutions used in this presentation represent two fictional companies in the IT industry
1. Contributions can also be interpreted as weights which are used in calculating output probability of a firm being a high growth firm
2. Assumption: Significant where contribution for BottomUp < 0.5*(Apexify)
Apexify Solutions does an excellent job in creating value from their technical workforce
Impact on financial performance
% Data Analytics
% Software Development
% Engineering Management
Apexify
Labs
BottomUp
Solutions
15. Interactive app takes new allocations for
firm BottomUp Solutions’s key human
capital metrics as an input and predicts
if this firm would have been a high
growth firm with these new allocations.
15
Note: Above app takes following inputs for 2017-2019: % Technical workforce, % employees across SQL/ Linux/ Software Development, % employees across % Java/ C/ C++ and %R&D Expense
as of Total Revenue for 2017 as an input and predicts if firm would have been a high growth firm or not in 2020.
App Screenshot
Tech workforce
Software Development
Java
R&D
36
20
Human Capital Factors Existing Allocations (%)
8
8
New Allocations (%)
35
20
4
10
Run
Total Revenue Growth Rate with existing allocations is –5.5%
With new allocations BottomUp Solutions could have been a LOW growth firm. i.e. It’s predicted
Total Revenue Growth Rate < 2.5%
Understanding Human Capital Analytics can help the organization to
proactively plan its Human Capital decisions, which can lead to positive
financial performance
16. 36% 45%
Female Board Members
Dashboard view of metrics and scenario planning decisions
With a few changes BottomUp Solutions could have increased growth rate by 8%
16
Recommendation
One of the ways BottomUp Solutions might
have increased its revenue growth rate from
–5.5% to at least 2.5% is by making following
changes to its knowledge-based capital
– Increasing % employees with SQL/
Linux/ Software Development and Java/
C/ C++ skills by 4%
– Increasing %R&D spend by 5%
*Currently focusing on only certain aspects of
boosting productivity – hiring additional
employees with technical skills and
increasing R&D spend
Technical Skills
Software Development
Java
Research & Development
36% 40%
8% 12%
8% 13%
Note: The above combination shows only selected actionable human capital factors whose allocations are obtained by trial and error. Please note it represents one of the all-possible combinations
that could have classified firm as a high growth firm.
18. Enhance the Human Capital Index
– Explore better performing models that use alternative financial metrics such as EBITDA, total revenue, etc.
– Create new features to better explain our data
– Improve the interpretability of continuous Human Capital Index by creating distinct subcategories
– Incorporate new data, historical data, global data, data on startups, D&I
– Look into Causality
Next Steps
Overall, we can improve our human capital index and make it more robust by increasing number of firms and
adding new features to our model
18