Big Data Camp LA 2014, Don't re-invent the Big-Data Wheel, Building real-time, Big Data applications on Cassandra with the open-source Kiji project by Clint Kelly of Wibidata
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...Data Con LA
The team at Fandango heartily embraced NoSQL, using Couchbase to power a key media publishing system. The initial implementation was fraught with integration issues and high latency, and required a major effort to successfully refactor. My talk will outline the key organizational and architectural decisions that created deep systemic problems, and the steps taken to re-architect the system to achieve a high level of performance at scale.
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...Data Con LA
The team at Fandango heartily embraced NoSQL, using Couchbase to power a key media publishing system. The initial implementation was fraught with integration issues and high latency, and required a major effort to successfully refactor. My talk will outline the key organizational and architectural decisions that created deep systemic problems, and the steps taken to re-architect the system to achieve a high level of performance at scale.
Big Data Day LA 2015 - Solr Search with Spark for Big Data Analytics in Actio...Data Con LA
Apache Solr makes it so easy to interactively visualize and explore your data. Create a dashboard, add some facets, select some values, cross it with the time… and just look at the results.
Apache Spark is the growing framework for performing streaming computations, which makes it ideal for real time indexing.
Solr also comes with new Analytics Facets which are a major weapon added to the arsenal of the data explorer. They bring another dimension: calculations. We can now do the equivalent of SQL, just in a much simpler and faster way. These calculations can operate over buckets of data. For example, it is now possible to see the sum of Web traffic by country over the time, the median price of some categories of products, which ads are bringing more money by location...
This talks puts in practice some of the leading features of Solr Search. It presents the main types of facets/stats and which advanced properties and usage make them shine. A demo in parallel with the open source Search App in Hue will demonstrate how these facets can power interactive widgets or your own analytic queries. The data will be indexed in real time from a live stream with Spark.
Hadoop and NoSQL joining forces by Dale Kim of MapRData Con LA
More and more organizations are turning to Hadoop and NoSQL to manage big data. In fact, many IT professionals consider each of those terms to be synonymous with big data. At the same time, these two technologies are seen as different beasts that handle different challenges. That means they are often deployed in a rather disjointed way, even when intended to solve the same overarching business problem. The emerging trend of “in-Hadoop databases” promises to narrow the deployment gap between them and enable new enterprise applications. In this talk, Dale will describe that integrated architecture and how customers have deployed it to benefit both the technical and the business teams.
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...Data Con LA
During my time working on attribution and ingest systems, I've encountered several different approaches to solving the simple question: "How do I get data from A to B". In this session, I'd like to share some of the problems I've encountered and how to effectively solve them.
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...Data Con LA
NoSQL has exploded on the developer scene promising alternatives to RDBMS that make rapidly developing, Internet scale applications easier than ever. However, as a trade off to the ease of development and scale, some of the familiarity with other well-known query interfaces such as SQL, has been lost. Until now that is...N1QL (pronounced ‘N1QL’) is a SQL like query language for querying JSON, which brings the familiarity of RDBMS back to the NoSQL world. In this session you will learn about the syntax and basics of this new language as well as Integration with the Couchbase SDKs.
Big Data Day LA 2015 - Deep Learning Human Vocalized Animal Sounds by Sabri S...Data Con LA
Investigated a couple audio based, deep learning strategies for identifying human vocalized car sounds. In one case Mel Frequency Cepstral Coefficients, MFCCs, were used as inputs into a supervised, logistic regression neural network. In a separate case, Short Term Fourier Transforms ,STFT, were used to generate PCA whitened spectograms, which were used as inputs into a supervised, convolutional neural network. The MFCC method trained quickly on a relative small dataset of 4 sounds. The STFT method resulted in a much larger input matrix, resulting in much longer times for converging onto a solution
Big Data Day LA 2016/ Data Science Track - Decision Making and Lambda Archite...Data Con LA
Online decision making over time needs interacting with an ever changing environment and underlying machine learning models need to change and adapt to this changing environment. This talk discusses class of machine learning algorithms and provides details of how the computation is parallelized using the Spark framework.
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...Data Con LA
Twitter generates billions and billions of events per day. Analyzing these events in real time presents a massive challenge. Twitter designed and deployed a new streaming system called Heron. Heron has been in production nearly 2 years and is widely used by several teams for diverse use cases. This talk looks at Twitter's operating experiences and challenges of running Heron at scale and the approaches taken to solve those challenges.
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...Data Con LA
Alluxio, formerly Tachyon, is a memory speed virtual distributed storage system. The Alluxio open source community is one of the fastest growing open source communities in big data history with more than 300 developers from over 100 organizations around the world. In the past year, the Alluxio project experienced a tremendous improvement in performance and scalability and was extended with key new features including tiered storage, transparent naming, and unified namespace. Alluxio now supports a wide range of under storage systems, including Amazon S3, Google Cloud Storage, Gluster, Ceph, HDFS, NFS, and OpenStack Swift. This year, our goal is to make Alluxio accessible to an even wider set of users, through our focus on security, new language bindings, and further increased stability.
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...Data Con LA
This talk draws on our experience in debugging and analyzing Hadoop jobs to describe some methodical approaches to this and present current and new tracing and tooling ideas that can help semi-automate parts of this difficult problem.
Big Data Day LA 2016/ NoSQL track - MongoDB 3.2 Goodness!!!, Mark Helmstetter...Data Con LA
This talk explores the new features of MongoDB 3.2 such as $lookup, document validation rules, encryption-at-rest and tools like the BI Connector, OpsManager 2.0 and Compass.
Big Data Day LA 2016/ Data Science Track - Data Science + Hollywood, Todd Ho...Data Con LA
Netflix will spend six billion dollars this year on content, making the company a major player in Hollywood. An increasing portion of this spend will be on original shows such as House of Cards, and original movies such as Beasts of No Nation. As we continue to expand our involvement with Hollywood, we want to leverage data and data science to make the best decisions possible. This talk will explore areas where we see the most opportunity to apply data science to Hollywood, and some early approaches we've taken.
Big Data Day LA 2016/ Use Case Driven track - Data and Hollywood: "Je t'Aime ...Data Con LA
Application of machine learning to problems such as script and story analysis, audience segmentation, and security, is revolutionizing the way Hollywood is creating and marketing entertainment.
In memory computing principles by Mac Moore of GridGainData Con LA
In the presentation, we will provide an overview of general in-memory computing principles and the drivers behind it. We will start with a summary of the technical drivers (abundant hardware resources) and market forces (the rise of Big Data). We will cover popular and emerging use cases for in-memory computing, from financial industry trading platforms to mobile payment processing, online advertising, online/mobile gaming back-ends and more. We will then present some foundational concepts and terminology, and discuss considerations around any in-memory solution. From there, we will illustrate how a complete in-memory computing stack like GridGain combines clustering, high performance computing, in-memory data grids, stream processing and Hadoop acceleration into one unified and easy to use platform.
SF Big Analytics: Introduction to Succinct by UC Berkeley AmpLabChester Chen
Topic: Introduction to Succinct by UC Berkeley AmpLab.
"Cloud services today need to perform fast, interactive queries on large data volumes. Several recent studies have shown that data is growing faster than memory capacity, making in-memory query execution increasingly challenging. At UC Berkeley, we have built Succinct, a distributed data store that overcomes this problem by enabling a wide range of interactive queries (e.g., search, random access, range queries, and even regular expressions) directly on compressed data. Besides its ability to execute queries on compressed data, Succinct differs from existing data stores along several dimensions. First, Succinct unifies several powerful data models (key-value stores, document stores, tables, etc.) using a single interface. Second, Succinct enables applications to choose a desired compression factor, allowing applications to use larger memory for improved performance. Finally, Succinct allows applications to change the compression factor on the fly, enabling new approaches to handling skewed query distributions, time-varying loads, and failure tolerance. In this talk, I will describe Succinct's design, implementation and semantics. Succinct is completely open-sourced, and we have also released Succinct as a library that simplifies integration of Succinct data structures and techniques with existing data stores.”
Speaker bio:
"Anurag is a graduate student at AMPLab, UC Berkeley, where he is advised by Prof. Ion Stoica. He co-created Succinct with Rachit Agarwal and Ion Stoica."
You can find more information about the project here: http://succinct.cs.berkeley.edu/wp/wordpress/?p=143
Big Data Day LA 2015 - Solr Search with Spark for Big Data Analytics in Actio...Data Con LA
Apache Solr makes it so easy to interactively visualize and explore your data. Create a dashboard, add some facets, select some values, cross it with the time… and just look at the results.
Apache Spark is the growing framework for performing streaming computations, which makes it ideal for real time indexing.
Solr also comes with new Analytics Facets which are a major weapon added to the arsenal of the data explorer. They bring another dimension: calculations. We can now do the equivalent of SQL, just in a much simpler and faster way. These calculations can operate over buckets of data. For example, it is now possible to see the sum of Web traffic by country over the time, the median price of some categories of products, which ads are bringing more money by location...
This talks puts in practice some of the leading features of Solr Search. It presents the main types of facets/stats and which advanced properties and usage make them shine. A demo in parallel with the open source Search App in Hue will demonstrate how these facets can power interactive widgets or your own analytic queries. The data will be indexed in real time from a live stream with Spark.
Hadoop and NoSQL joining forces by Dale Kim of MapRData Con LA
More and more organizations are turning to Hadoop and NoSQL to manage big data. In fact, many IT professionals consider each of those terms to be synonymous with big data. At the same time, these two technologies are seen as different beasts that handle different challenges. That means they are often deployed in a rather disjointed way, even when intended to solve the same overarching business problem. The emerging trend of “in-Hadoop databases” promises to narrow the deployment gap between them and enable new enterprise applications. In this talk, Dale will describe that integrated architecture and how customers have deployed it to benefit both the technical and the business teams.
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...Data Con LA
During my time working on attribution and ingest systems, I've encountered several different approaches to solving the simple question: "How do I get data from A to B". In this session, I'd like to share some of the problems I've encountered and how to effectively solve them.
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...Data Con LA
NoSQL has exploded on the developer scene promising alternatives to RDBMS that make rapidly developing, Internet scale applications easier than ever. However, as a trade off to the ease of development and scale, some of the familiarity with other well-known query interfaces such as SQL, has been lost. Until now that is...N1QL (pronounced ‘N1QL’) is a SQL like query language for querying JSON, which brings the familiarity of RDBMS back to the NoSQL world. In this session you will learn about the syntax and basics of this new language as well as Integration with the Couchbase SDKs.
Big Data Day LA 2015 - Deep Learning Human Vocalized Animal Sounds by Sabri S...Data Con LA
Investigated a couple audio based, deep learning strategies for identifying human vocalized car sounds. In one case Mel Frequency Cepstral Coefficients, MFCCs, were used as inputs into a supervised, logistic regression neural network. In a separate case, Short Term Fourier Transforms ,STFT, were used to generate PCA whitened spectograms, which were used as inputs into a supervised, convolutional neural network. The MFCC method trained quickly on a relative small dataset of 4 sounds. The STFT method resulted in a much larger input matrix, resulting in much longer times for converging onto a solution
Big Data Day LA 2016/ Data Science Track - Decision Making and Lambda Archite...Data Con LA
Online decision making over time needs interacting with an ever changing environment and underlying machine learning models need to change and adapt to this changing environment. This talk discusses class of machine learning algorithms and provides details of how the computation is parallelized using the Spark framework.
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...Data Con LA
Twitter generates billions and billions of events per day. Analyzing these events in real time presents a massive challenge. Twitter designed and deployed a new streaming system called Heron. Heron has been in production nearly 2 years and is widely used by several teams for diverse use cases. This talk looks at Twitter's operating experiences and challenges of running Heron at scale and the approaches taken to solve those challenges.
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...Data Con LA
Alluxio, formerly Tachyon, is a memory speed virtual distributed storage system. The Alluxio open source community is one of the fastest growing open source communities in big data history with more than 300 developers from over 100 organizations around the world. In the past year, the Alluxio project experienced a tremendous improvement in performance and scalability and was extended with key new features including tiered storage, transparent naming, and unified namespace. Alluxio now supports a wide range of under storage systems, including Amazon S3, Google Cloud Storage, Gluster, Ceph, HDFS, NFS, and OpenStack Swift. This year, our goal is to make Alluxio accessible to an even wider set of users, through our focus on security, new language bindings, and further increased stability.
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...Data Con LA
This talk draws on our experience in debugging and analyzing Hadoop jobs to describe some methodical approaches to this and present current and new tracing and tooling ideas that can help semi-automate parts of this difficult problem.
Big Data Day LA 2016/ NoSQL track - MongoDB 3.2 Goodness!!!, Mark Helmstetter...Data Con LA
This talk explores the new features of MongoDB 3.2 such as $lookup, document validation rules, encryption-at-rest and tools like the BI Connector, OpsManager 2.0 and Compass.
Big Data Day LA 2016/ Data Science Track - Data Science + Hollywood, Todd Ho...Data Con LA
Netflix will spend six billion dollars this year on content, making the company a major player in Hollywood. An increasing portion of this spend will be on original shows such as House of Cards, and original movies such as Beasts of No Nation. As we continue to expand our involvement with Hollywood, we want to leverage data and data science to make the best decisions possible. This talk will explore areas where we see the most opportunity to apply data science to Hollywood, and some early approaches we've taken.
Big Data Day LA 2016/ Use Case Driven track - Data and Hollywood: "Je t'Aime ...Data Con LA
Application of machine learning to problems such as script and story analysis, audience segmentation, and security, is revolutionizing the way Hollywood is creating and marketing entertainment.
In memory computing principles by Mac Moore of GridGainData Con LA
In the presentation, we will provide an overview of general in-memory computing principles and the drivers behind it. We will start with a summary of the technical drivers (abundant hardware resources) and market forces (the rise of Big Data). We will cover popular and emerging use cases for in-memory computing, from financial industry trading platforms to mobile payment processing, online advertising, online/mobile gaming back-ends and more. We will then present some foundational concepts and terminology, and discuss considerations around any in-memory solution. From there, we will illustrate how a complete in-memory computing stack like GridGain combines clustering, high performance computing, in-memory data grids, stream processing and Hadoop acceleration into one unified and easy to use platform.
SF Big Analytics: Introduction to Succinct by UC Berkeley AmpLabChester Chen
Topic: Introduction to Succinct by UC Berkeley AmpLab.
"Cloud services today need to perform fast, interactive queries on large data volumes. Several recent studies have shown that data is growing faster than memory capacity, making in-memory query execution increasingly challenging. At UC Berkeley, we have built Succinct, a distributed data store that overcomes this problem by enabling a wide range of interactive queries (e.g., search, random access, range queries, and even regular expressions) directly on compressed data. Besides its ability to execute queries on compressed data, Succinct differs from existing data stores along several dimensions. First, Succinct unifies several powerful data models (key-value stores, document stores, tables, etc.) using a single interface. Second, Succinct enables applications to choose a desired compression factor, allowing applications to use larger memory for improved performance. Finally, Succinct allows applications to change the compression factor on the fly, enabling new approaches to handling skewed query distributions, time-varying loads, and failure tolerance. In this talk, I will describe Succinct's design, implementation and semantics. Succinct is completely open-sourced, and we have also released Succinct as a library that simplifies integration of Succinct data structures and techniques with existing data stores.”
Speaker bio:
"Anurag is a graduate student at AMPLab, UC Berkeley, where he is advised by Prof. Ion Stoica. He co-created Succinct with Rachit Agarwal and Ion Stoica."
You can find more information about the project here: http://succinct.cs.berkeley.edu/wp/wordpress/?p=143
Denys Kovalenko "Scaling Data Science at Bolt"Fwdays
Data has always been crucial to the growth of Bolt. One of the fastest-growing European companies, it challenges US giants whose engineering teams are orders of magnitude larger. In this talk, I’ll share how development of Data Science Platform enables us to grow fast and make Machine Learning in production accessible and reliable.
website: https://fwdays.com/en/event/data-science-fwdays-2019/review/scaling-data-science-at-bolt
Building a social network in under 4 weeks with Serverless and GraphQLYan Cui
Serverless technologies drastically simplify the task of building modern, scalable APIs in the cloud, and GraphQL makes it easy for frontend teams to consume these APIs and to iterate quickly on your product idea. Together, they are a perfect combination for a product-focused, full-stack team to deliver customer values quickly.
In this talk, see how we built a new social network mobile app in under 4 weeks using Lambda, AppSync, DynamoDB and Algolia. How we approached CI/CD, testing, authentication and lessons we learnt along the way.
Real-world serverless podcast: https://realworldserverless.com
Learn Lambda best practices: https://lambdabestpractice.com
Blog: https://theburningmonk.com
Consulting services: https://theburningmonk.com/hire-me
Production-Ready Serverless workshop: https://productionreadyserverless.com
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...Grokking VN
- Speaker: Hervé Vũ Roussel - CEO & Co-founder @ QuodAI
- Vài nét về speaker: Hervé Vũ Roussel trước đây đã từng là CTO của một công ty phần mềm ở Silicon Valley Mỹ. Anh đã và đang là advisor và mentor cho nhiều tổ chức như IBM AI XPRIZE, PlatoHQ (YC'16), RMIT, AngelHack, ... Anh cũng là một trong các diễn giả thường xuyên cho chủ đề AI và Software engineer cũng như đã tư vấn cho nhiều trường đại học, công ty về các chương trình đào tạo khoa học máy tính và kỹ sư phần mềm. Hiện tại, Hervé đang là CEO của Quod AI, một nền tảng giúp giải thích source code bằng ngôn ngữ tự nhiên.
Đến với talk lần này anh sẽ chia sẻ kinh nghiệm của mình trong việc thiết kế một kiến trúc chịu tải cao và dễ mở rộng (highly scalable architecture) cho các nền tảng AI bao gồm:
- Những nguyên tắc nền tảng trong xây dựng kiến trúc phần mềm
- Cách lựa chọn công nghệ lưu trữ dữ liệu
- Xây dựng data pipelines bất đồng bộ
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdfAltinity Ltd
OSA Con 2022: Scaling your Pandas Analytics with Modin
Doris Lee - Ponder
Pandas is one of the most commonly used data science libraries in Python, with a convenient set of APIs for data cleaning, visualization, analysis, and exploration. However, despite its widespread adoption, Pandas suffers from severe scalability issues on large datasets. We developed the open-source project Modin, which is a fast, scalable drop-in replacement for pandas. Modin has been downloaded more than 4 million times and is used by leading data science teams, including Fortune 100 companies.
PredictionIO - Building Applications That Predict User Behavior Through Big D...predictionio
Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies
Presented by PredictionIO at Big Data TechCon (Oct 17, 2013)
Serverless technologies drastically simplify the task of building modern, scalable APIs in the cloud, and GraphQL makes it easy for frontend teams to consume these APIs and to iterate quickly on your product idea. Together, they are a perfect combination for a product-focused, full-stack team to deliver customer values quickly.
In this talk, see how we built a new social network mobile app in under 4 weeks using Lambda, AppSync, DynamoDB and Algolia. How we approached CI/CD, testing, authentication and lessons we learnt along the way.
Real-world serverless podcast: https://realworldserverless.com
Learn Lambda best practices: https://lambdabestpractice.com
Blog: https://theburningmonk.com
Consulting services: https://theburningmonk.com/hire-me
Production-Ready Serverless workshop: https://productionreadyserverless.com
Similar to Kiji cassandra la june 2014 - v02 clint-kelly (20)
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA
Mike Limcaco, Analytics Specialist / Customer Engineer at Google
Measure trends in a particular topic or search term across Google Search across the US down to the city-level. Integrate these data signals into analytic pipelines to drive product, retail, media (video, audio, digital content) recommendations tailored to your audience segment. We'll discuss how Google unique datasets can be used with Google Cloud smart analytic services to process, enrich and surface the most relevant product or content that matches the ever-changing interests of your local customer segment.
Melinda Thielbar, Data Science Practice Lead and Director of Data Science at Fidelity Investments
From corporations to governments to private individuals, most of the AI community has recognized the growing need to incorporate ethics into the development and maintenance of AI models. Much of the current discussion, though, is meant for leaders and managers. This talk is directed to data scientists, data engineers, ML Ops specialists, and anyone else who is responsible for the hands-on, day-to-day of work building, productionalizing, and maintaining AI models. We'll give a short overview of the business case for why technical AI expertise is critical to developing an AI Ethics strategy. Then we'll discuss the technical problems that cause AI models to behave unethically, how to detect problems at all phases of model development, and the tools and techniques that are available to support technical teams in Ethical AI development.
Data Con LA 2022 - Improving disaster response with machine learningData Con LA
Antje Barth, Principal Developer Advocate, AI/ML at AWS & Chris Fregly, Principal Engineer, AI & ML at AWS
The frequency and severity of natural disasters are increasing. In response, governments, businesses, nonprofits, and international organizations are placing more emphasis on disaster preparedness and response. Many organizations are accelerating their efforts to make their data publicly available for others to use. Repositories such as the Registry of Open Data on AWS and Humanitarian Data Exchange contain troves of data available for use by developers, data scientists, and machine learning practitioners. In this session, see how a community of developers came together though the AWS Disaster Response hackathon to build models to support natural disaster preparedness and response.
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA
Sig Narvaez, Executive Solution Architect at MongoDB
MongoDB is now a Developer Data Platform. Come learn what�s new in the 6.0 release and Atlas following all the recent announcements made at MongoDB World 2022. Topics will include
- Atlas Search which combines 3 systems into one (database, search engine, and sync mechanisms) letting you focus on your product's differentiation.
- Atlas Data Federation to seamlessly query, transform, and aggregate data from one or more MongoDB Atlas databases, Atlas Data Lake and AWS S3 buckets
- Queryable Encryption lets you run expressive queries on fully randomized encrypted data to meet the most stringent security requirements
- Relational Migrator which analyzes your existing relational schemas and helps you design a new MongoDB schema.
- And more!
Data Con LA 2022 - Real world consumer segmentationData Con LA
Jaysen Gillespie, Head of Analytics and Data Science at RTB House
1. Shopkick has over 30M downloads, but the userbase is very heterogeneous. Anecdotal evidence indicated a wide variety of users for whom the app holds long-term appeal.
2. Marketing and other teams challenged Analytics to get beyond basic summary statistics and develop a holistic segmentation of the userbase.
3. Shopkick's data science team used SQL and python to gather data, clean data, and then perform a data-driven segmentation using a k-means algorithm.
4. Interpreting the results is more work -- and more fun -- than running the algo itself. We'll discuss how we transform from ""segment 1"", ""segment 2"", etc. to something that non-analytics users (Marketing, Operations, etc.) could actually benefit from.
5. So what? How did team across Shopkick change their approach given what Analytics had discovered.
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA
Ravi Pillala, Chief Data Architect & Distinguished Engineer at Intuit
TurboTax is one of the well known consumer software brand which at its peak serves 385K+ concurrent users. In this session, We start with looking at how user behavioral data & tax domain events are captured in real time using the event bus and analyzed to drive real time personalization with various TurboTax data pipelines. We will also look at solutions performing analytics which make use of these events, with the help of Kafka, Apache Flink, Apache Beam, Spark, Amazon S3, Amazon EMR, Redshift, Athena and Amazon lambda functions. Finally, we look at how SageMaker is used to create the TurboTax model to predict if a customer is at risk or needs help.
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA
George Mansoor, Chief Information Systems Officer at California State University
Overview of the CSU Data Architecture on moving on-prem ERP data to the AWS Cloud at scale using Delphix for Data Replication/Virtualization and AWS Data Migration Service (DMS) for data extracts
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA
Anand Ranganathan, Chief AI Officer at Unscrambl
Conversational AI is getting more and more widely used for customer support and employee support use-cases. In this session, I'm going to talk about how it can be extended for data analysis and data science use-cases ... i.e., how users can interact with a bot to ask analytical questions on data in relational databases.
This allows users to explore complex datasets using a combination of text and voice questions, in natural language, and then get back results in a combination of natural language and visualizations. Furthermore, it allows collaborative exploration of data by a group of users in a channel in platforms like Microsoft Teams, Slack or Google Chat.
For example, a group of users in a channel can ask questions to a bot in plain English like ""How many cases of Covid were there in the last 2 months by state and gender"" or ""Why did the number of deaths from Covid increase in May 2022"", and jointly look at the results that come back. This facilitates data awareness, data-driven collaboration and joint decision making among teams in enterprises and outside.
In this talk, I'll describe how we can bring together various features including natural-language understanding, NL-to-SQL translation, dialog management, data story-telling, semantic modeling of data and augmented analytics to facilitate collaborate exploration of data using conversational AI.
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA
Anil Inamdar, VP & Head of Data Solutions at Instaclustr
The most modernized enterprises utilize polyglot architecture, applying the best-suited database technologies to each of their organization's particular use cases. To successfully implement such an architecture, though, you need a thorough knowledge of the expansive NoSQL data technologies now available.
Attendees of this Data Con LA presentation will come away with:
-- A solid understanding of the decision-making process that should go into vetting NoSQL technologies and how to plan out their data modernization initiatives and migrations.
-- They will learn the types of functionality that best match the strengths of NoSQL key-value stores, graph databases, columnar databases, document-type databases, time-series databases, and more.
-- Attendees will also understand how to navigate database technology licensing concerns, and to recognize the types of vendors they'll encounter across the NoSQL ecosystem. This includes sniffing out open-core vendors that may advertise as “open source,"" but are driven by a business model that hinges on achieving proprietary lock-in.
-- Attendees will also learn to determine if vendors offer open-code solutions that apply restrictive licensing, or if they support true open source technologies like Hadoop, Cassandra, Kafka, OpenSearch, Redis, Spark, and many more that offer total portability and true freedom of use.
Data Con LA 2022 - Intro to Data ScienceData Con LA
Zia Khan, Computer Systems Analyst and Data Scientist at LearningFuze
Data Science tutorial is designed for people who are new to Data Science. This is a beginner level session so no prior coding or technical knowledge is required. Just bring your laptop with WiFi capability. The session starts with a review of what is data science, the amount of data we generate and how companies are using that data to get insight. We will pick a business use case, define the data science process, followed by hands-on lab using python and Jupyter notebook. During the hands-on portion we will work with pandas, numpy, matplotlib and sklearn modules and use a machine learning algorithm to approach the business use case.
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA
Mariana Danilovic, Managing Director at Infiom, LLC
We will address:
(1) Community creation and engagement using tokens and NFTs
(2) Organization of DAO structures and ways to incentivize Web3 communities
(3) DeFi business models applied to Web3 ventures
(4) Why Metaverse matters for new entertainment and community engagement models.
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA
Curtis ODell, Global Director Data Integrity at Tricentis
Join me to learn about a new end-to-end data testing approach designed for modern data pipelines that fills dangerous gaps left by traditional data management tools—one designed to handle structured and unstructured data from any source. You'll hear how you can use unique automation technology to reach up to 90 percent test coverage rates and deliver trustworthy analytical and operational data at scale. Several real world use cases from major banks/finance, insurance, health analytics, and Snowflake examples will be presented.
Key Learning Objective
1. Data journeys are complex and you have to ensure integrity of the data end to end across this journey from source to end reporting for compliance
2. Data Management tools do not test data, they profile and monitor at best, and leave serious gaps in your data testing coverage
3. Automation with integration to DevOps and DataOps' CI/CD processes are key to solving this.
4. How this approach has impact in your vertical
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA
Arif Ansari, Professor at University of Southern California
Super Bowl Ad cost $7 million and each year a few Super Bowl ads go viral. The traditional A/B testing does not predict virality. Some highly shared ones reach over 60 million organic views, which can be more valuable than views on TV. Not only are these voluntary, but they are typically without distraction, and win viewer engagement in the form of likes, comments, or shares. A Super Bowl ad that wins 69 million views on YouTube (e.g., Alexa Mind Reader) costs less than 10 cents per quality view! However, the challenge is triggering virality. We developed a method to predict virality and engineer virality into Ads.
1. Prof. Gerard J. Tellis and co-authors recommended that advertisers use YouTube to tease, test, and tweak (TTT) their ads to maximize sharing and viewing. 2022 saw that maxim put into practice.
2. We developed viral Ads prediction using two scientific models:
a. Prof. Gerard Tellis et al.'s model for viral prediction
b. Deep Learning viral prediction using social media effect
3. The model was able to identify all the top 15 Viral Ads it performed better than the traditional agencies.
4. New proposed method is Tease, Test, Tweak, Target and Spots Ad.
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA
Jai Bansal, Senior Manager, Data Science at Aetna
This talk describes an internal data product called Member Embeddings that facilitates modeling of member medical journeys with machine learning.
Medical claims are the key data source we use to understand health journeys at Aetna. Claims are the data artifacts that result from our members' interactions with the healthcare system. Claims contain data like the amount the provider billed, the place of service, and provider specialty. The primary medical information in a claim is represented in codes that indicate the diagnoses, procedures, or drugs for which a member was billed. These codes give us a semi-structured view into the medical reason for each claim and so contain rich information about members' health journeys. However, since the codes themselves are categorical and high-dimensional (10K cardinality), it's challenging to extract insight or predictive power directly from the raw codes on a claim.
To transform claim codes into a more useful format for machine learning, we turned to the concept of embeddings. Word embeddings are widely used in natural language processing to provide numeric vector representations of individual words.
We use a similar approach with our claims data. We treat each claim code as a word or token and use embedding algorithms to learn lower-dimensional vector representations that preserve the original high-dimensional semantic meaning.
This process converts the categorical features into dense numeric representations. In our case, we use sequences of anonymized member claim diagnosis, procedure, and drug codes as training data. We tested a variety of algorithms to learn embeddings for each type of claim code.
We found that the trained embeddings showed relationships between codes that were reasonable from the point of view of subject matter experts. In addition, using the embeddings to predict future healthcare-related events outperformed other basic features, making this tool an easy way to improve predictive model performance and save data scientist time.
Data Con LA 2022 - Data Streaming with KafkaData Con LA
Jie Chen, Manager Advisory, KPMG
Data is the new oil. However, many organizations have fragmented data in siloed line of businesses. In this topic, we will focus on identifying the legacy patterns and their limitations and introducing the new patterns packed by Kafka's core design ideas. The goal is to tirelessly pursue better solutions for organizations to overcome the bottleneck in data pipelines and modernize the digital assets for ready to scale their businesses. In summary, we will walk through three uses cases, recommend Dos and Donts, Take aways for Data Engineers, Data Scientist, Data architect in developing forefront data oriented skills.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
1. Don’t Reinvent the
Big-Data Wheel!
Clint Kelly - @clintwkelly
WibiData
Building real-time, Big Data applications on
Cassandra with the open-source Kiji project
Big Data Camp LA
14 June 2014
98. KijiSchema
(Cassandra)
How does it work?
Data
Science
Write
Read
KijiREST
Stream
User 1
User 2
User 3
Query
KijiHive
C
C
C
EngineeringChannels
99. KijiSchema
(Cassandra)
How does it work?
Data
Science
Write
Read
KijiREST
Stream
User 1
User 2
User 3
Query
KijiHive
Data
C
C
C
EngineeringChannels
100. KijiSchema
(Cassandra)
How does it work?
Data
Science
Write
Read
KijiREST
Stream
User 1
User 2
User 3
Query
KijiHive
Data
C
C
C
EngineeringChannels
101. KijiSchema
(Cassandra)
How does it work?
Data
Science
Write
Read
KijiREST
Stream
User 1
User 2
User 3
Query
KijiHive
Data
C
C
C
EngineeringChannels
102. KijiSchema
(Cassandra)
How does it work?
Data
Science
Write
Read
KijiREST
Stream
User 1
User 2
User 3
Query
KijiHive
Data
C
C
C
EngineeringChannels
103. KijiSchema
(Cassandra)
How does it work?
Data
Science
Write
Read
KijiREST
Stream
User 1
User 2
User 3
Query
KijiHive
KijiMR
C
C
C
EngineeringChannels
Data
104. KijiSchema
(Cassandra)
How does it work?
Data
Science
Write
Read
KijiREST
Stream
User 1
User 2
User 3
Query
KijiHive
KijiExpress
KijiMR
C
C
C
EngineeringChannels
Data
105. KijiSchema
(Cassandra)
How does it work?
Data
Science
Write
Read
KijiREST
Stream
User 1
User 2
User 3
Query
KijiHive
KijiExpress
KijiMR
Scorer
C
C
C
EngineeringChannels
Data
106. KijiSchema
(Cassandra)
How does it work?
Data
Science
Write
Read
KijiREST
Stream
User 1
User 2
User 3
Query
KijiHive
KijiExpress
KijiMR
Scorer
C
C
C
EngineeringChannels
Data
107. KijiSchema
(Cassandra)
How does it work?
Data
Science
Write
Read
KijiREST
Stream
User 1
User 2
User 3
Query
KijiHive
KijiExpress
KijiMR
Scorer
C
C
C
R
EngineeringChannels
Data
108. KijiSchema
(Cassandra)
How does it work?
Data
Science
Write
Read
KijiREST
Stream
User 1
User 2
User 3
Query
KijiHive
KijiExpress
KijiMR
Scorer
C
C
C
EngineeringChannels
Data
109. KijiSchema
(Cassandra)
How does it work?
Data
Science
Write
Read
KijiREST
Stream
User 1
User 2
User 3
Query
KijiHive
KijiExpress
KijiMR
Scorer
C
C
C
EngineeringChannels
Data
110. KijiSchema
(Cassandra)
How does it work?
Data
Science
Write
Read
KijiREST
Stream
User 1
User 2
User 3
Query
KijiHive
KijiExpress
KijiMR
Scorer
C
C
C
R
R
R
EngineeringChannels
Data
111. KijiSchema
(Cassandra)
How does it work?
Data
Science
Write
Read
KijiREST
Stream
User 1
User 2
User 3
Query
KijiHive
KijiExpress
KijiMR
KijiScoring
C
C
C
R
Kiji Model
Repository
EngineeringChannels
Data
Scorer
112. KijiSchema
(Cassandra)
How does it work?
Data
Science
Write
Read
KijiREST
Stream
User 1
User 2
User 3
Query
KijiHive
KijiExpress
KijiMR
KijiScoring
C
C
C
R
Kiji Model
Repository
EngineeringChannels
Data
Scorer
113. KijiSchema
(Cassandra)
How does it work?
Data
Science
Write
Read
KijiREST
Stream
User 1
User 2
User 3
Query
KijiHive
KijiExpress
KijiMR
KijiScoring
C
C
C
R
Kiji Model
Repository
EngineeringChannels
Data
Scorer
114. KijiSchema
(Cassandra)
How does it work?
Data
Science
Write
Read
KijiREST
Stream
User 1
User 2
User 3
Query
KijiHive
KijiExpress
KijiMR
KijiScoring
C
C
C
R
Kiji Model
Repository
EngineeringChannels
Data
Scorer
115. KijiSchema
(Cassandra)
How does it work?
Data
Science
Write
Read
KijiREST
Stream
User 1
User 2
User 3
Query
KijiHive
KijiExpress
KijiMR
KijiScoring
C
C
C
R
Kiji Model
Repository
EngineeringChannels
Data
Scorer
116. KijiSchema
(Cassandra)
How does it work?
Data
Science
Write
Read
KijiREST
Stream
User 1
User 2
User 3
Query
KijiHive
KijiExpress
KijiMR
KijiScoring
C
C
C
R
Kiji Model
Repository
EngineeringChannels
Data
Scorer
117. KijiSchema
(Cassandra)
How does it work?
Data
Science
Write
Read
KijiREST
Stream
User 1
User 2
User 3
Query
KijiHive
KijiExpress
KijiMR
KijiScoring
C
C
C
R
Kiji Model
Repository
EngineeringChannels
Data
Scorer
118. KijiSchema
(Cassandra)
How does it work?
Data
Science
Write
Read
KijiREST
Stream
User 1
User 2
User 3
Query
KijiHive
KijiExpress
KijiMR
KijiScoring
C
C
C
R
Kiji Model
Repository
EngineeringChannels
Data
Scorer
119. KijiSchema
(Cassandra)
How does it work?
Data
Science
Write
Read
KijiREST
Stream
User 1
User 2
User 3
Query
KijiHive
KijiExpress
KijiMR
KijiScoring
C
C
C
R
Kiji Model
Repository
EngineeringChannels
Data
Scorer
R
120. KijiSchema
(Cassandra)
How does it work?
Data
Science
Write
Read
KijiREST
Stream
User 1
User 2
User 3
Query
KijiHive
KijiExpress
KijiMR
KijiScoring
C
C
C
R
Kiji Model
Repository
EngineeringChannels
Data
Scorer
R
R
121. KijiSchema
(Cassandra)
How does it work?
Data
Science
Write
Read
KijiREST
Stream
User 1
User 2
User 3
Query
KijiHive
KijiExpress
KijiMR
KijiScoring
C
C
C
R
Kiji Model
Repository
EngineeringChannels
Data
Scorer
R
R
R
147. Timestamped versions
songs:
let it be
inter:
search0xfa “bob” songs:
let it besongs:
let it besongs:
let it be
inter:
clicks
1396560123
payment:
cardnum
payment:
address
rec:
scorer2
rec:
scorer3rec:
scorer3rec:
scorer3
rec:
scorer1
1395650231
148. Complex data types
record Search {
string search_term;
long session_id;
device_type device;
}
songs:
let it be
inter:
search0xfa “bob” songs:
let it besongs:
let it besongs:
let it be
inter:
clicks
1396560123
payment:
cardnum
payment:
address
rec:
scorer2
rec:
scorer3rec:
scorer3rec:
scorer3
rec:
scorer1
1395650231
161. Locality group ➔ Column family
CREATE TABLE loc_grp
songs:
let it be
inter:
search0xfa “bob” songs:
let it besongs:
let it besongs:
let it be
inter:
clicks
1396560123
payment:
cardnum
payment:
address
rec:
scorer2
rec:
scorer3rec:
scorer3rec:
scorer3
rec:
scorer1
1395650231
162. Entity ID ➔ Primary key
CREATE TABLE loc_grp (city text, user text,
PRIMARY KEY (city, user) )
WITH CLUSTERING ORDER BY (user ASC);
songs:
let it be
inter:
search0xfa “bob” songs:
let it besongs:
let it besongs:
let it be
inter:
clicks
1396560123
payment:
cardnum
payment:
address
rec:
scorer2
rec:
scorer3rec:
scorer3rec:
scorer3
rec:
scorer1
1395650231
163. Family, Qualifier,Version ➔ Clustering Columns
CREATE TABLE loc_grp (city text, user text,
family text, qualifier text, version bigint,
PRIMARY KEY (city, user, family, qualifier, version) )
WITH CLUSTERING ORDER BY (user ASC, family ASC, qualifier ASC,
version DESC);
songs:
let it be
inter:
search0xfa “bob” songs:
let it besongs:
let it besongs:
let it be
inter:
clicks
1396560123
payment:
cardnum
payment:
address
rec:
scorer2
rec:
scorer3rec:
scorer3rec:
scorer3
rec:
scorer1
1395650231
164. Column values ➔ Blobs
CREATE TABLE loc_grp (city text, user text,
family text, qualifier text, version bigint, value blob,
PRIMARY KEY (city, user, family, qualifier, version) )
WITH CLUSTERING ORDER BY (user ASC, family ASC, qualifier ASC,
version DESC);
songs:
let it be
inter:
search0xfa “bob” songs:
let it besongs:
let it besongs:
let it be
inter:
clicks
1396560123
payment:
cardnum
payment:
address
rec:
scorer2
rec:
scorer3rec:
scorer3rec:
scorer3
rec:
scorer1
1395650231
175. Operations across locality groups
Kiji locality group ➔ C* column family
Read across locality groups
➔ multiple C* reads (async API!)
176. Operations across locality groups
Kiji locality group ➔ C* column family
Read across locality groups
➔ multiple C* reads (async API!)
177. Operations across locality groups
Kiji locality group ➔ C* column family
Read across locality groups
➔ multiple C* reads (async API!)
Compare-and-set across locality groups
178. Operations across locality groups
Kiji locality group ➔ C* column family
Read across locality groups
➔ multiple C* reads (async API!)
Compare-and-set across locality groups
➔ not allowed in C* Kiji
179. Operations across locality groups
Kiji locality group ➔ C* column family
Read across locality groups
➔ multiple C* reads (async API!)
Compare-and-set across locality groups
➔ not allowed in C* Kiji
180. Operations across locality groups
Kiji locality group ➔ C* column family
Read across locality groups
➔ multiple C* reads (async API!)
Compare-and-set across locality groups
➔ not allowed in C* Kiji
Lose transactional consistency