This document discusses the high-level architecture for a data platform to support a customer 360 view using data from connected vehicles (taxis). The architecture includes data sources, streaming data ingestion using Kafka, schema validation, stream processing for transformations and routing, and storage for analytics, search and long-term retention. The presentation covers design considerations for reliability, scalability and processing of both streaming and batch data to meet requirements like querying, visualization, and batch processing of historical data.
Hadoop application architectures - using Customer 360 as an examplehadooparchbook
Hadoop application architectures - using Customer 360 (more generally, Entity 360) as an example. By Ted Malaska, Jonathan Seidman and Mark Grover at Strata + Hadoop World 2016 in NYC.
Architecting a next-generation data platformhadooparchbook
Slides for Architecting a next-generation data platform at Strata + Hadoop World, London 2017.
https://conferences.oreilly.com/strata/strata-eu/public/schedule/detail/57652
Building a fraud detection application using the tools in the Hadoop ecosystem. Presentation given by authors of O'Reilly's Hadoop Application Architectures book at Strata + Hadoop World in San Jose, CA 2016.
Hadoop application architectures - using Customer 360 as an examplehadooparchbook
Hadoop application architectures - using Customer 360 (more generally, Entity 360) as an example. By Ted Malaska, Jonathan Seidman and Mark Grover at Strata + Hadoop World 2016 in NYC.
Architecting a next-generation data platformhadooparchbook
Slides for Architecting a next-generation data platform at Strata + Hadoop World, London 2017.
https://conferences.oreilly.com/strata/strata-eu/public/schedule/detail/57652
Building a fraud detection application using the tools in the Hadoop ecosystem. Presentation given by authors of O'Reilly's Hadoop Application Architectures book at Strata + Hadoop World in San Jose, CA 2016.
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data StreamingMichael Rainey
We produce quite a lot of data! Much of the data are business transactions stored in a relational database. More frequently, the data are non-structured, high volume and rapidly changing datasets known in the industry as Big Data. The challenge for data integration professionals is to combine and transform the data into useful information. Not just that, but it must also be done in near real-time and using a target system such as Hadoop. The topic of this session, real-time data streaming, provides a great solution for this challenging task. By integrating GoldenGate, Oracle’s premier data replication technology, and Apache Kafka, the latest open-source streaming and messaging system, we can implement a fast, durable, and scalable solution.
Presented at Oracle OpenWorld 2016
Oracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data StreamingMichael Rainey
We produce quite a lot of data. Some of this data comes in the form of business transactions and is stored in a relational database. This relational data is often combined with other non-structured, high volume and rapidly changing datasets known in the industry as Big Data. The challenge for us as data integration professionals is to then combine this data and transform it into something useful. Not just that, but we must also do it in near real-time and using a big data target system such as Hadoop. The topic of this session, real-time data streaming, provides us a great solution for that challenging task. By combining GoldenGate, Oracle’s premier data replication technology, and Apache Kafka, the latest open-source streaming and messaging system for big data, we can implement a fast, durable, and scalable solution. This session will walk through the implementation of GoldenGate and Kafka.
Presented at Collaborate16 in Las Vegas.
Hadoop YARN is the next generation computing platform in Apache Hadoop with support for programming paradigms besides MapReduce. In the world of Big Data, one cannot solve all the problems wholly using the Map Reduce programming model. Typical installations run separate programming models like MR, MPI, graph-processing frameworks on individual clusters. Running fewer larger clusters is cheaper than running more small clusters. Therefore,_leveraging YARN to allow both MR and non-MR applications to run on top of a common cluster becomes more important from an economical and operational point of view. This talk will cover the different APIs and RPC protocols that are available for developers to implement new application frameworks on top of YARN. We will also go through a simple application which demonstrates how one can implement their own Application Master, schedule requests to the YARN resource-manager and then subsequently use the allocated resources to run user code on the NodeManagers.
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...Databricks
This talk is about sharing experience and lessons learned on setting up and running the Apache Spark service inside the database group at CERN. It covers the many aspects of this change with examples taken from use cases and projects at the CERN Hadoop, Spark, streaming and database services. The talks is aimed at developers, DBAs, service managers and members of the Spark community who are using and/or investigating “Big Data” solutions deployed alongside relational database processing systems. The talk highlights key aspects of Apache Spark that have fuelled its rapid adoption for CERN use cases and for the data processing community at large, including the fact that it provides easy to use APIs that unify, under one large umbrella, many different types of data processing workloads from ETL, to SQL reporting to ML.
Spark can also easily integrate a large variety of data sources, from file-based formats to relational databases and more. Notably, Spark can easily scale up data pipelines and workloads from laptops to large clusters of commodity hardware or on the cloud. The talk also addresses some key points about the adoption process and learning curve around Apache Spark and the related “Big Data” tools for a community of developers and DBAs at CERN with a background in relational database operations.
Hadoop Security and Compliance - StampedeCon 2016StampedeCon
As Hadoop becomes a mainstream data platform across organizations, securing a vast and growing volume of critical information, especially financial and healthcare data, is more essential than ever. In this presentation, Derek will elaborate how to leverage Big Data technologies without sacrificing security and compliance, and will focus specially on how comprehensive security mechanisms should be put in place to secure a production ready Hadoop environment. The presentation will also highlight technologies, such as encryption in-motion and at-rest for Hadoop services, as well as the complicated compliant processes to meet strictest regulatory requirements and standards.
HBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend MicroCloudera, Inc.
Trend Micro developed the new security features in HBase 0.92 and has the first known deployment of secure HBase in production. We will share our motivations, use cases, experiences, and provide a 10 minute tutorial on how to set up a test secure HBase cluster and a walk through of a simple usage example. The tutorial will be carried out live on an on-demand EC2 cluster, with a video backup in case of network or EC2 unavailability.
Real-Time Data Replication to Hadoop using GoldenGate 12c AdaptorsMichael Rainey
Oracle GoldenGate 12c is well known for its highly performant data replication between relational databases. With the GoldenGate Adaptors, the tool can now apply the source transactions to a Big Data target, such as HDFS. In this session, we'll explore the different options for utilizing Oracle GoldenGate 12c to perform real-time data replication from a relational source database into HDFS. The GoldenGate Adaptors will be used to load movie data from the source to HDFS for use by Hive. Next, we'll take the demo a step further and publish the source transactions to a Flume agent, allowing Flume to handle the final load into the targets.
Presented at the Oracle Technology Network Virtual Technology Summit February/March 2015.
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...Data Con LA
NoSQL has exploded on the developer scene promising alternatives to RDBMS that make rapidly developing, Internet scale applications easier than ever. However, as a trade off to the ease of development and scale, some of the familiarity with other well-known query interfaces such as SQL, has been lost. Until now that is...N1QL (pronounced ‘N1QL’) is a SQL like query language for querying JSON, which brings the familiarity of RDBMS back to the NoSQL world. In this session you will learn about the syntax and basics of this new language as well as Integration with the Couchbase SDKs.
Data Con LA 2020
Description
In this session, I introduce the Amazon Redshift lake house architecture which enables you to query data across your data warehouse, data lake, and operational databases to gain faster and deeper insights. With a lake house architecture, you can store data in open file formats in your Amazon S3 data lake.
Speaker
Antje Barth, Amazon Web Services, Sr. Developer Advocate, AI and Machine Learning
2015 nov 27_thug_paytm_rt_ingest_brief_finalAdam Muise
Paytm Labs provides a quick overview of their Hadoop data ingest platform. We cover our journey from a batch focused ingest system with SQOOP to a streaming ingest supported by Kafka, Confluent.io, Hadoop, Cassandra, and Spark Streaming. This presentation also provides an overview of our complete data platform including our feature creation template
Architecting a Next Gen Data Platform – Strata New York 2018Jonathan Seidman
Using Customer 360 and the internet of things as examples, this tutorial explains how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, including components like Kafka, Flink, Kudu, Spark Streaming, and Spark SQL and modern storage engines to enable new forms of data processing and analytics.
Architecting next generation big data platformhadooparchbook
A tutorial on architecting next generation big data platform by the authors of O'Reilly's Hadoop Application Architectures book. This tutorial discusses how to build a customer 360 (or entity 360) big data application.
Audience: Technical.
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data StreamingMichael Rainey
We produce quite a lot of data! Much of the data are business transactions stored in a relational database. More frequently, the data are non-structured, high volume and rapidly changing datasets known in the industry as Big Data. The challenge for data integration professionals is to combine and transform the data into useful information. Not just that, but it must also be done in near real-time and using a target system such as Hadoop. The topic of this session, real-time data streaming, provides a great solution for this challenging task. By integrating GoldenGate, Oracle’s premier data replication technology, and Apache Kafka, the latest open-source streaming and messaging system, we can implement a fast, durable, and scalable solution.
Presented at Oracle OpenWorld 2016
Oracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data StreamingMichael Rainey
We produce quite a lot of data. Some of this data comes in the form of business transactions and is stored in a relational database. This relational data is often combined with other non-structured, high volume and rapidly changing datasets known in the industry as Big Data. The challenge for us as data integration professionals is to then combine this data and transform it into something useful. Not just that, but we must also do it in near real-time and using a big data target system such as Hadoop. The topic of this session, real-time data streaming, provides us a great solution for that challenging task. By combining GoldenGate, Oracle’s premier data replication technology, and Apache Kafka, the latest open-source streaming and messaging system for big data, we can implement a fast, durable, and scalable solution. This session will walk through the implementation of GoldenGate and Kafka.
Presented at Collaborate16 in Las Vegas.
Hadoop YARN is the next generation computing platform in Apache Hadoop with support for programming paradigms besides MapReduce. In the world of Big Data, one cannot solve all the problems wholly using the Map Reduce programming model. Typical installations run separate programming models like MR, MPI, graph-processing frameworks on individual clusters. Running fewer larger clusters is cheaper than running more small clusters. Therefore,_leveraging YARN to allow both MR and non-MR applications to run on top of a common cluster becomes more important from an economical and operational point of view. This talk will cover the different APIs and RPC protocols that are available for developers to implement new application frameworks on top of YARN. We will also go through a simple application which demonstrates how one can implement their own Application Master, schedule requests to the YARN resource-manager and then subsequently use the allocated resources to run user code on the NodeManagers.
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...Databricks
This talk is about sharing experience and lessons learned on setting up and running the Apache Spark service inside the database group at CERN. It covers the many aspects of this change with examples taken from use cases and projects at the CERN Hadoop, Spark, streaming and database services. The talks is aimed at developers, DBAs, service managers and members of the Spark community who are using and/or investigating “Big Data” solutions deployed alongside relational database processing systems. The talk highlights key aspects of Apache Spark that have fuelled its rapid adoption for CERN use cases and for the data processing community at large, including the fact that it provides easy to use APIs that unify, under one large umbrella, many different types of data processing workloads from ETL, to SQL reporting to ML.
Spark can also easily integrate a large variety of data sources, from file-based formats to relational databases and more. Notably, Spark can easily scale up data pipelines and workloads from laptops to large clusters of commodity hardware or on the cloud. The talk also addresses some key points about the adoption process and learning curve around Apache Spark and the related “Big Data” tools for a community of developers and DBAs at CERN with a background in relational database operations.
Hadoop Security and Compliance - StampedeCon 2016StampedeCon
As Hadoop becomes a mainstream data platform across organizations, securing a vast and growing volume of critical information, especially financial and healthcare data, is more essential than ever. In this presentation, Derek will elaborate how to leverage Big Data technologies without sacrificing security and compliance, and will focus specially on how comprehensive security mechanisms should be put in place to secure a production ready Hadoop environment. The presentation will also highlight technologies, such as encryption in-motion and at-rest for Hadoop services, as well as the complicated compliant processes to meet strictest regulatory requirements and standards.
HBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend MicroCloudera, Inc.
Trend Micro developed the new security features in HBase 0.92 and has the first known deployment of secure HBase in production. We will share our motivations, use cases, experiences, and provide a 10 minute tutorial on how to set up a test secure HBase cluster and a walk through of a simple usage example. The tutorial will be carried out live on an on-demand EC2 cluster, with a video backup in case of network or EC2 unavailability.
Real-Time Data Replication to Hadoop using GoldenGate 12c AdaptorsMichael Rainey
Oracle GoldenGate 12c is well known for its highly performant data replication between relational databases. With the GoldenGate Adaptors, the tool can now apply the source transactions to a Big Data target, such as HDFS. In this session, we'll explore the different options for utilizing Oracle GoldenGate 12c to perform real-time data replication from a relational source database into HDFS. The GoldenGate Adaptors will be used to load movie data from the source to HDFS for use by Hive. Next, we'll take the demo a step further and publish the source transactions to a Flume agent, allowing Flume to handle the final load into the targets.
Presented at the Oracle Technology Network Virtual Technology Summit February/March 2015.
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...Data Con LA
NoSQL has exploded on the developer scene promising alternatives to RDBMS that make rapidly developing, Internet scale applications easier than ever. However, as a trade off to the ease of development and scale, some of the familiarity with other well-known query interfaces such as SQL, has been lost. Until now that is...N1QL (pronounced ‘N1QL’) is a SQL like query language for querying JSON, which brings the familiarity of RDBMS back to the NoSQL world. In this session you will learn about the syntax and basics of this new language as well as Integration with the Couchbase SDKs.
Data Con LA 2020
Description
In this session, I introduce the Amazon Redshift lake house architecture which enables you to query data across your data warehouse, data lake, and operational databases to gain faster and deeper insights. With a lake house architecture, you can store data in open file formats in your Amazon S3 data lake.
Speaker
Antje Barth, Amazon Web Services, Sr. Developer Advocate, AI and Machine Learning
2015 nov 27_thug_paytm_rt_ingest_brief_finalAdam Muise
Paytm Labs provides a quick overview of their Hadoop data ingest platform. We cover our journey from a batch focused ingest system with SQOOP to a streaming ingest supported by Kafka, Confluent.io, Hadoop, Cassandra, and Spark Streaming. This presentation also provides an overview of our complete data platform including our feature creation template
Architecting a Next Gen Data Platform – Strata New York 2018Jonathan Seidman
Using Customer 360 and the internet of things as examples, this tutorial explains how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, including components like Kafka, Flink, Kudu, Spark Streaming, and Spark SQL and modern storage engines to enable new forms of data processing and analytics.
Architecting next generation big data platformhadooparchbook
A tutorial on architecting next generation big data platform by the authors of O'Reilly's Hadoop Application Architectures book. This tutorial discusses how to build a customer 360 (or entity 360) big data application.
Audience: Technical.
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksDatabricks
The cloud has become one of the most attractive ways for enterprises to purchase software, but it requires building products in a very different way from traditional software
Productionizing Machine Learning with Apache Spark, MLflow and ONNX from the ...Databricks
One of the biggest challenges which customers face is how to productionize machine learning for enterprises. Once the Data scientist, Data Engineers, Business analyst, Machine learning engineer have successfully built their Machine Learning Models, they need model management a system that manages and orchestrates the entire lifecycle of machine learning models.
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Databricks
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Comcast, GrubHub, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
Delivering Insights from 20M+ Smart Homes with 500M+ DevicesDatabricks
We started out processing big data using AWS S3, EMR clusters, and Athena to serve Analytics data extracts to Tableau BI.
However as our data and teams sizes increased, Avro schemas from source data evolved, and we attempted to serve analytics data through Web apps, we hit a number of limitations in the AWS EMR, Glue/Athena approach.
This is a story of how we scaled out our data processing and boosted team productivity to meet our current demand for insights from 20M+ Smart Homes and 500M+ devices across the globe, from numerous internal business teams and our 150+ CSP partners.
We will describe lessons learnt and best practices established as we enabled our teams with DataBricks autoscaling Job clusters and Notebooks and migrated our Avro/Parquet data to use MetaStore, SQL Endpoints and SQLA Console, while charting the path to the Delta lake…
Stargate, the gateway for some multi-models data APIData Con LA
Data Con LA 2020
Description
Join us to learn about Stargate! Stargate is a data gateway deployed between client applications and a database. It's built with extensibility as a first-class citizen and makes it easy to use a database for any application workload by adding plugin support for new APIs, data types, and access methods. After detailing the architecture and ideas behind the frameworks we will demo the creation of REST and GraphQL APIs on top of Cassandra through simple configuration. Bring back home a working sample !
Speaker
Cedrick Lunven, Director of Developer Advocacy, Datastax
3 Things to Learn:
-How data is driving digital transformation to help businesses innovate rapidly
-How Choice Hotels (one of largest hoteliers) is using Cloudera Enterprise to gain meaningful insights that drive their business
-How Choice Hotels has transformed business through innovative use of Apache Hadoop, Cloudera Enterprise, and deployment in the cloud — from developing customer experiences to meeting IT compliance requirements
GSJUG: Mastering Data Streaming Pipelines 09May2023Timothy Spann
GSJUG: Mastering Data Streaming Pipelines 09May2023
https://www.meetup.com/futureofdata-princeton/events/293233881/
This is a repost from the Garden State Java Users Group Event.
Join me at
https://www.meetup.com/garden-state-java-user-group/events/293229660/
See: https://www.eventbrite.com/e/mastering-data-streaming-pipelines-tickets-627677218457?_ga=2.253257801.1787151623.1682868226-741104479.1678110925
Please note that registration via EventBrite is required to attend either in-person or online.
We are happy to announce that Tim Spann will be our special guest for the May 9, 2023 meeting!
Abstract:
In this session, Tim will show you some best practices that he has discovered over the last seven years in building data streaming applications including IoT, CDC, Logs, and more.
In his modern approach, we utilize several Apache frameworks to maximize the best features of all. We often start with Apache NiFi as the orchestrator of streams flowing into Apache Kafka. From there we build streaming ETL with Apache Flink, enhance events with NiFi enrichment. We build continuous queries against our topics with Flink SQL.
We will show where Java fits in as sources, enrichments, NiFi processors and sinks.
We hope to see you on May 9!
Speaker
Timothy Spann
Tim Spann is a Principal Developer Advocate in Data In Motion for Cloudera. He works with Apache NiFi, Apache Pulsar, Apache Kafka, Apache Flink, Flink SQL, Apache Pinot, Trino, Apache Iceberg, DeltaLake, Apache Spark, Big Data, IoT, Cloud, AI/DL, machine learning, and deep learning. Tim has over ten years of experience with the IoT, big data, distributed computing, messaging, streaming technologies, and Java programming.
Previously, he was a Developer Advocate at StreamNative, Principal DataFlow Field Engineer at Cloudera, a Senior Solutions Engineer at Hortonworks, a Senior Solutions Architect at AirisData, a Senior Field Engineer at Pivotal and a Team Leader at HPE. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton & NYC on Big Data, Cloud, IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as ApacheCon, DeveloperWeek, Pulsar Summit and many more. He holds a BS and MS in computer science.
In this session, Tim will show you some best practices that he has discovered over the last seven years in building data streaming applications, including IoT, CDC, Logs, and more.
In his modern approach, we utilize several Apache frameworks to maximize the best features of all. We often start with Apache NiFi as the orchestrator of streams flowing into Apache Kafka. From there, we build streaming ETL with Apache Flink, enhance events with NiFi enrichment. We build continuous queries against our topics with Flink SQL.
We will show where Java fits in as sources, enrichments, NiFi processors, and sinks.
https://www.eventbrite.com/e/mastering-data-streaming-pipelines-tickets-627677218457?_ga=2.253257801.178
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaGuido Schmutz
Apache Kafka is a popular distributed streaming data platform and more and more is the architectural backbone for integrating streaming data with a Data Lake, Microservices and Stream Processing. A lot of data necessary in stream processing is stored in traditional systems backed by relational databases. This session will present different approaches for integrating relational databases with Kafka, such as Kafka Connect, Oracle GoldenGate, ORDS APIs and bridging Kafka with Oracle AQ.
AWS re:Invent 2016 was AWS’ largest event yet with over 32,000 attendees, 400 breakout sessions, and two keynotes of new product announcements. In this talk, we’ll explore the core themes of AWS re:Invent 2016 such as serverless and artificial intelligence. We will also drill down into several of the services and features unveiled including AWS Batch, AWS Shield, Aurora for Postgres, X-Ray, Polly, Lex, Rekognition, AWS Step Functions. Light appetizers and refreshments will be provided.
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Jonathan Seidman
A look at common patterns being applied to leverage Hadoop with traditional data management systems and the emerging landscape of tools which provide access and analysis of Hadoop data with existing systems such as data warehouses, relational databases, and business intelligence tools.
Extending the Data Warehouse with Hadoop - Hadoop world 2011Jonathan Seidman
Hadoop provides the ability to extract business intelligence from extremely large, heterogeneous data sets that were previously impractical to store and process in traditional data warehouses. The challenge now is in bridging the gap between the data warehouse and Hadoop. In this talk we’ll discuss some steps that Orbitz has taken to bridge this gap, including examples of how Hadoop and Hive are used to aggregate data from large data sets, and how that data can be combined with relational data to create new reports that provide actionable intelligence to business users.
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
4. Questions?
tiny.cloudera.com/sgquestions
About the presenters
▪ BNet Group Architect at Blizzard
▪ Cloudera Principal Solution Architect
▪ Architect at FINRA
▪ Contributor to:
▪ Apache Spark,
▪ Hadoop,
▪ Hive,
▪ Sqoop,
▪ Yarn,
▪ Flume,
▪ Etc.
Ted Malaska
5. Questions?
tiny.cloudera.com/sgquestions
About the presenters
▪ Software Engineer at Cloudera
▪ Contributor to Apache Sqoop
▪ Previously Technical Lead on the big data team at
Orbitz
▪ Co-founder of the Chicago Hadoop User Group and
Chicago Big Data
Jonathan Seidman
20. Questions?
tiny.cloudera.com/sgquestions
Requirements
▪ To support all this, we need:
- Reliable ingestion of streaming and batch data.
- Ability to perform transformations on streaming data in flight.
- Ability to perform sophisticated processing of historical data.
- Reliable and scalable storage to support modeling and processing of multiple data
formats.
24. Questions?
tiny.cloudera.com/sgquestions
High level architecture
Data Sources Streaming Pipes
Schema Validation
Enrichment
Stream Processing
Routing
StorageProducers
Transport
Replication
Long Term SQL
Speed Layer SQL
Time Series
Reverse Indexed
Stream
Access
SQL
Machine Learning
Request Response
Batch Processing
Code
Agents
Log Aggregators
25. Questions?
tiny.cloudera.com/sgquestions
Key to Customer 360 Success
Your project is only as good as the quality and variety of data sources
Geo-location/
Traffic Data
Customer DataMaintenance
Data
Other Data
Sources
Streaming
Vehicle Data
Files
CSV? XML?
JSON?
Twitter?
Mainframe?
Database Salesforce?
MQTT
26. Questions?
tiny.cloudera.com/sgquestions
High Level Architecture
Data Sources Streaming Pipes
Schema Validation
Enrichment
Stream Processing
Routing
StorageProducers
Transport
Replication
Long Term SQL
Speed Layer SQL
Time Series
Reverse Indexed
Stream
Access
SQL
Machine Learning
Request Response
Batch Processing
Code
Agents
Log Aggregators
32. Questions?
tiny.cloudera.com/sgquestions
REST Proxy
Talking to Non-native Kafka Apps and Outside the Firewall
REST Proxy
Non-Java Applications
Native Kafka Java Applications
REST / HTTP
Simplifies administrative
actions
Simplifies message creation
and consumption
Provides a RESTful
interface to a Kafka
cluster
33. Questions?
tiny.cloudera.com/sgquestions
Kafka Connect
Streaming Data Capture
JDBC
Logs
MQTT
RDBMS
Key/Value
HDFS
Kafka Connect API
Kafka
Sources Sinks
Fault tolerant
Manage hundreds of data
sources and sinks
Preserves data schema
Part of Apache Kafka
project
Includes simple
transformations
41. Questions?
tiny.cloudera.com/sgquestions
Goals for our Transport Layer
▪ To meet these goals we want some kind of publish-subscribe queue:
- Kafka
- Kinesis
- RabbitMQ
- Azure Queues
- Azure Service Bus
- Google Pub/Sub
- etc…
43. Questions?
tiny.cloudera.com/sgquestions
Buffering Data
▪ What do we mean by “buffering” and why do we need it?
event,event,event,event,event,event…
This is bad!
▪ Network partitions happen
▪ Producers and
Consumers work at
different rates
▪ Reliable storage is hard
Stream processing is hard
Lets do one at a time
45. Questions?
tiny.cloudera.com/sgquestions
What is Kafka?
▪ It’s like a message queue, right?
- Actually, it’s a “distributed commit log”
- Or “streaming data platform”
0 1 2 3 4 5 6 7 8
Data
Source
Data
Consumer
A
Data
Consumer
B
46. Questions?
tiny.cloudera.com/sgquestions
Topics and Partitions
▪ Messages are organized into topics, and each topic is split into partitions.
- Each partition is an immutable, time-sequenced log of messages on disk.
- Note that time ordering is guaranteed within, but not across, partitions.
0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8
Partition 0
Partition 1
Partition 2
Data
Source
Topic
50. Questions?
tiny.cloudera.com/sgquestions
Kafka Considerations – Reliability
▪ Different reliability levels for topics:
Taxi Trip Data
Kafka
taxi-trip-input
Twitter customer-sentiment
100% – dups
are ok
(“At least
once”)
<=100%
(“At most
once”)
News Flash:
Kafka’s Exactly Once
Producer is on the way
63. Questions?
tiny.cloudera.com/sgquestions
How many partitions?
▪ Adding partitions late in the game is painful
▪ Basic formula:
total desired throughput / throughput of slowest consumer or producer
▪ Or ~25GB disk space
▪ Not too many because:
- Each partition takes broker heap memory and file handles
- Each partition slows down node shutdown / recovery
- 1000 – 4000 partitions per broker max
- Producers will produce smaller batches – lower throughput
65. Questions?
tiny.cloudera.com/sgquestions
Guarding Against Message Loss
▪ Producer – What happens if the producer loses connection to Kafka and the buffer
overflows?
- You get an exception. You can choose to… block? Write to file?
▪ Source – What happens if events are lost before getting sent to producer?
- Once again use some kind of buffer to provide sufficient retention of data.
70. Questions?
tiny.cloudera.com/sgquestions
What do we mean by streaming?
Constant low
milliseconds & under
Low milliseconds to
seconds, delay in
case of failures
10s of seconds or
more, re-run in case
of failures
Real-time Near real-time Batch
71. Questions?
tiny.cloudera.com/sgquestions
What do we mean by streaming?
Constant low
milliseconds & under
Low milliseconds to
seconds, delay in
case of failures
10s of seconds or
more, re-run in case
of failures
Real-time Near real-time Batch
72. Questions?
tiny.cloudera.com/sgquestions
But, there’s no free lunch
Constant low
milliseconds & under
Low milliseconds to
seconds, delay in
case of failures
10s of seconds or
more, re-run in case
of failures
Real-time Near real-time Batch
“Difficult” architectures, lower
latency
“Easier” architectures, higher
latency
87. Questions?
tiny.cloudera.com/sgquestions
Ingestion
- File Systems & Object Store
- Normally you want larger files
- Normally you want high compression
- Normally you want deduping
- Window of deduping
- 100% deduping in this case is difficult
- Think about sequence numbering from source
89. Questions?
tiny.cloudera.com/sgquestions
Ingestion
- NoSQL Time Series
- Dumb inserts
- You may need Aggregation across metrics
- To increase throughput you may want to buffer writes
- Context
- Row is a key time and data points There
- Row1 Time1, DataPoint1
- Row1 Time2, DataPoint2
- Is slower then
- Row1 Time1, DataPoint1, Time2, DataPoint2
91. Questions?
tiny.cloudera.com/sgquestions
Aggregation & Counting
- This is were we talk about Lamdba
- There are many definitions
- Common but not correct: Jobs that involve both Batch & Streaming
- Correct: Perfect count is not possible with streaming so we use a combination of streaming and
batch to show the right value
92. Questions?
tiny.cloudera.com/sgquestions
Aggregation & Counting
- Why is streaming not perfect
Incrementing Speed Layer NoSQL
Get Event to Process
Increment
Acknowledge Event
Get Event Again
Increment
Acknowledge Event
Value: 10
Value: 12
Value: 12
Value: 12
Value: 14
Value: 14
+2
+2
98. Questions?
tiny.cloudera.com/sgquestions
Aggregation & Counting
- Failure Problem Solved by Adding internal State
Micro Batch Layer
Get Batch Y
Count Batch Y
Update StateByKey
Foreach Value Put
Acknowledge Batch
Reset State to Start of X Batch
Get Batch Y
Count Batch Y Value
Update StateByKey
Foreach Value Put
Acknowledge Batch
NoSQL
Value: 12
Value: 12
Value: 12
Value: 14
Value: 14
Value: 14
Value: 14
Value: 14
Value: 14
Value: 14
Value: 14
Micro Batch Layer
Result as Batch X
Value = 10
Result as Batch X
Value = 12
Micro Batch Layer
Result as Batch X
Value = 10
Result as Batch X
Value = 12
Micro Batch Layer
Result as Batch X
Value = 10
Result as Batch X
Value = 12
Put(14)
Put(14)
100. Questions?
tiny.cloudera.com/sgquestions
Aggregation & Counting
- Deduping with Sequence Numbers
Source Sequence Value
A 1 10
B 1 100
A 2 10
B 2 100
A 3 10
B 2 100
B 3 100
Seq of A Value of A Seq of B Value of B
1 10 - -
1 10 1 100
2 20 1 100
2 20 2 200
3 30 2 200
3 30 2 200
3 30 3 300
111. Questions?
tiny.cloudera.com/sgquestions
Delivery Types
▪ At most once
- Not good for many cases
- Only where performance/SLA is more important than accuracy
▪ Exactly once
- Expensive to achieve but desirable
▪ At least once
- Easiest to achieve
120. Questions?
tiny.cloudera.com/sgquestions
Spark Streaming - Gaps
▪ Not as low of a latency
- Efforts towards reducing latency e.g. RISElab’s Drizzle
▪ Global consistent execution state
- Stop overall execution of distributed computation
- Eagerly persist records in transit meaning larger snapshots
121. Questions?
tiny.cloudera.com/sgquestions
Flink
▪ True “streaming” system, but not as feature rich as Spark
▪ Much better event time handling
▪ Good built-in backpressure support
▪ Allows stateful transformations
▪ Lower Latency
- No Micro Batching
- Asynchronous Barrier Snapshotting (ABS)
129. Questions?
tiny.cloudera.com/sgquestions
Kafka Streams
▪ Good integration with Kafka
▪ Light-weight library (not a framework)
▪ No micro-batching, uses Kafka as internal messaging layer
▪ Maintains local state per node (in RocksDB, or in memory hash map)
▪ Handles late events
▪ Stream-to-stream joins
138. Questions?
tiny.cloudera.com/sgquestions
Basic of GFS => HDFS
- NameNode
- Metadata of all the files/blocks
- Which data node they are assigned too
- Replication management
- Data Nodes
- Metadata for each block location on disk
139. Questions?
tiny.cloudera.com/sgquestions
Basic of GFS => HDFS
Client
NameNode
DataNode A
DataNode B
DataNode C
A
B
C
Write Path
A. Ask Name Node for Location to Write
B. Write to DataNode with NN Instructions
C. DataNode does replication
D. Confirms file is persisted to client
140. Questions?
tiny.cloudera.com/sgquestions
Basic of GFS => HDFS
- File are immutable
- File can be of any type
- Files are block up into Blocks (128MB -> 1GB)
- Metadata cost is at the Block not the data size
- File may be splittable or may not be when reading
142. Questions?
tiny.cloudera.com/sgquestions
Object Store (Like and not like HDFS)
- Like a HDFS
- Contains files
- Break up large files
- Not like a HDFS
- Not really a file system and is more Key value like a NoSQL
- Doesn’t have any metadata limit problem
- Traversing Folder directories is more work
- There is no rename, only copy and delete
- Eventually consist issues with listing files
- (seen with things like MR and Spark)
- Can be mostly addressed with EMRFS
144. Questions?
tiny.cloudera.com/sgquestions
Object Store (Thinking Remote)
- Unlike HDFS the storage is always remote
- Not on the same nodes as the execution
- Which allow you to save money in the cloud
- Execution nodes are expensive vs storage only
- Network will be used to Read and Write
- In fact you are normally throttled well before the network limit of your node
- You will want the highest rates of compression possible
- To save money on storage
- To read and write faster
148. Questions?
tiny.cloudera.com/sgquestions
Compression Codecs
- Snappy: 2x-3x : Fast Read, Fast Write
- Lzo : 2x-3x : Fast Read, Fast Write
- Gzip : ~8x: ~Fast Read, Normal Write
- Default : ~8x: ~Fast Read, Normal Write
- BZip2 : ~10x ~Fast Read, Slow Write
- Others ..
- Always be skeptical
- All data compresses differently
- Use your own data
149. Questions?
tiny.cloudera.com/sgquestions
Introducing the Hive Metastore
- Hive Metastore
- Adds a table like metadata layer over a file system, block store, NoSql, or other
- Allows for SQL access
- Allows for greater security options
- Allows for external metadata
- Allows for partitioning
153. Questions?
tiny.cloudera.com/sgquestions
Thinking about Object/Tables
1. Lets start off easy
1. Use Case: We are a Netflix type company and we have a log of users and movies watched
that looks something like this:
User ID Age Account Start
Date
Category Of User Movie Watched Movie Category Start Time Events List
Bob 42 12/12/2012 Basic Die Hard Action 5/4/2016 12:00 Play 0, pause at
15, FF at 40 to 55,
E at 90
Kat 31 12/12/2012 Platum Beauty and the
Beast
Family 5/4/2016 12:00 Play 0, pause at
15, FF at 40 to 55,
E at 90
154. Questions?
tiny.cloudera.com/sgquestions
Thinking about Object/Tables
1. To make this into objects we need to do some separation
User
User_id
Age
St_dt
Category
Movie
Movie_id
Title
Category
Watch_session
Watch_id
St_dt
En_dt
User_id
Movie_id
Watch_Events
Watch_id
St_dt
Type
Duration
Category_Typ
Category_id
Stream_rt
Is_feature_enabled
1 *
*
1
1
*
1*
155. Questions?
tiny.cloudera.com/sgquestions
Query Considerations
- Data is normally big so
- Partition respectively to access patterns
- Join with care
- Consider sampling or local testing before experimenting
- Data is files
- Latency to accessibility it high – seconds, minutes or more.
156. Questions?
tiny.cloudera.com/sgquestions
Look for big tables
User
User_id
Age
St_dt
Category
Movie
Movie_id
Title
Category
Watch_session
Watch_id
St_dt
En_dt
User_id
Movie_id
Watch_Events
Watch_id
St_dt
Type
Duration
Category_Typ
Category_id
Stream_rt
Is_feature_enabled
1 *
*
1
1
*
1*
159. Questions?
tiny.cloudera.com/sgquestions
View Strategies
Hive Relational Model
Hive Nested Model
Models
Hive Normal Views
Hive Materialized Table
Views
Use in the cases where the view requires
a join that is done through a shuffle
Use only for tables that filter
records/columns or use for marking fields
161. Questions?
tiny.cloudera.com/sgquestions
Nested
▪ Less Space than Denormalization
▪ Still have tables but the cost of joins is all but gone
▪ Also great for cartesian joins
- N x M vs N + M
▪ Not really supported yet with Kudu or HBase with SQL
162. Questions?
tiny.cloudera.com/sgquestions
Nested Example
CREATE TABLE fact_contacts (id BIGINT, name STRING, address
STRING) STORED AS PARQUET;
CREATE TABLE dim_phones
(
contact_id BIGINT
, category STRING
, international_code STRING
, area_code STRING
, exchange STRING
, extension STRING
, mobile BOOLEAN
, carrier STRING
, current BOOLEAN
, service_start_date TIMESTAMP
, service_end_date TIMESTAMP
)
164. Questions?
tiny.cloudera.com/sgquestions
De-normalized vs Nested
- Nested Pros
- Co-location
- Faster to group by
- Faster to window
- Joins are free
- Less data
- Better compression
- Tables and Columns can be read with out penalty from one not read
- Great for limiting the effort are Cartesian Joins
- Nested Cons
- Size limitation of parent row
- Adding child requires the re-write the the whole parent record
184. Questions?
tiny.cloudera.com/sgquestions
Hash Map
- There is a Key and a Value
- It is really fast to grab a key/value
- It is really fast to add a key/value
- Iteration is also possible
Key Value
A 1
B 1
C 1
D 1
E 1
F 1
G 1
Client
185. Questions?
tiny.cloudera.com/sgquestions
Log with Compactions
- When new records come in they don’t rewrite the old
- They compact in
Key Time Value
A 1 101
B 1 101
C 1 101
D 1 101
E 1 101
F 1 101
G 1 101
Key Time Value
A 2 102
D 2 102
F 2 102
F 3 103
H 3 103
Key Time Value
A 2 102
B 1 101
C 1 101
D 2 102
E 1 101
F 3 103
G 1 101
H 3 103
186. Questions?
tiny.cloudera.com/sgquestions
HDFS
Log with Compactions
- Write Path
- Get Local for Record (Cached)
- First to WAL
- Then to Memstore
- Sorting & batching
- Flush to New Hfile
- Later Hfiles will be compacted
Client
Master
RegionServer
Memstore
HFiles New HFiles
HFiles
WAL
187. Questions?
tiny.cloudera.com/sgquestions
HDFS
Ordered
- All Records Columns are ordered
- Ordering allows for simpler indexing
- Ordering allows for simpler compactions
- We will also use this ordering
- Windowing
- Time series
- Local scanning
Client
Master
RegionServer
Memstore
HFiles New HFiles
HFiles
189. Questions?
tiny.cloudera.com/sgquestions
So what about SQL
- Well SQL could totally work
- CQL for cassandra
- Hive and SparkSQL on HBase
- Why is it not the best idea
- Built more for point look ups
- Scans are not as fast as parquet
- However the mutability may be more important than speed
- Partitioning is not simple
- It must be put into the key
191. Questions?
tiny.cloudera.com/sgquestions
HBase Model
Client
Master
Region Server 1
Region Server 2
- Region Server owns range splits
- Region Server 1 fails
- Master needs to figure that out
- Master needs to assign new Region Server to own splits
- Region Server 2 has to get organized
- Region Server 2 is read to server reads and writes
199. Questions?
tiny.cloudera.com/sgquestions
What do they share (CAP theorem)
Consistency
Eventual Consistence
- Cheating the CAP Theorem
- Cassandra is a good model
- Where they expand the definition of failure
with variable consistence
- CAP still holds but …
202. Questions?
tiny.cloudera.com/sgquestions
Lucene Indexing (Facets)
- Facets are a side effect of out wonderful indexes
- It allows us to counts all the document that below to given indexes to produce
- Grouped Counts
- Charts and Graphs (kibana or Banana)
- People will also call this access pattern cubing a dataset
204. Questions?
tiny.cloudera.com/sgquestions
Lucene Indexing (Facets Example)
- Time Series Example
Document
ID
Hour of Day User State Event
1 12 4201 MD click
2 12 4202 VA click
3 12 4203 VA click
4 1 4201 MD click
5 1 4202 VA view
6 2 4204 CA click
7 2 4205 VA view
8 2 4201 MD click
205. Questions?
tiny.cloudera.com/sgquestions
Lucene Indexing (Facets Example)
Hour of
Day
12 1 2 3
1 4 5
2 6 7 8 9
Document
ID
Hour of
Day
User State Event
1 12 4201 MD click
2 12 4202 VA click
3 12 4203 VA click
4 1 4201 MD click
5 1 4202 VA view
6 2 4204 CA click
7 2 4205 VA view
8 2 4201 MD click
9 2 4204 CA click
User
4201 1 4 8
4202 2 5
4203 3
4204 6 9
4205 7
State
MD 1 4 8
VA 2 3 5 7
CA 6 9
Event
click 1 2 3 4 6 8 9
view 5 7
208. Questions?
tiny.cloudera.com/sgquestions
- Note the bucketing and ordered pattern
Lucene Indexing (Facets Example)
Hour of
Day 2
State
MD
State
VA
State CA
6 1 2 6
7 4 3 9
8 8 5
9 7State
MD 1 4 8
VA 2 3 5 7
CA 6 9
Hour of
Day
12 1 2 3
1 4 5
2 6 7 8 9
209. Questions?
tiny.cloudera.com/sgquestions
- Note the bucketing and ordered pattern
Lucene Indexing (Facets Example)
Hour of
Day 2
State
MD
State
VA
State CA
6 1 2 6
7 4 3 9
8 8 5
9 7
Hour of
Day 2
State
MD
State
VA
State CA
6 1 2 6
7 4 3 9
8 8 5
9 7
+1 CA
210. Questions?
tiny.cloudera.com/sgquestions
- Note the bucketing and ordered pattern
Lucene Indexing (Facets Example)
Hour of
Day 2
State
MD
State
VA
State CA
6 1 2 6
7 4 3 9
8 8 5
9 7
+1
VA
Hour of
Day 2
State
MD
State
VA
State CA
6 1 2 6
7 4 3 9
8 8 5
9 7
+1 MD
Hour of
Day 2
State
MD
State
VA
State CA
6 1 2 6
7 4 3 9
8 8 5
9 7
+1 CA
212. Questions?
tiny.cloudera.com/sgquestions
Writing Latency
- Lucene Indexing is more expensive then NoSQL work
- Think of it as micro batching
- Larger batches ~= better throughput
- Compaction is also invalid
- Deletes impact storage and performance until they are compacted
217. Questions?
tiny.cloudera.com/sgquestions
BSP Bulk Synchronous Parallel
- Process every Node Atomically
- Node gets all messages sent to it
- Nodes can mutate them selves and their edges
- Nodes can send messages to other nodes
- But nothing is received yet
- BSP waits until all the Node processing is done
- Then send messages to the right partition
- Repeat
220. Questions?
tiny.cloudera.com/sgquestions
Kudu
1. Replace the Region Servers with Tablet Servers
2. Replace block format HFile files with a parquet like TFiles
3. Replace the byte array focused HBase API with one that is more JDBC friendly
4. Tight integration with Spark SQL and Impala for SQL
5. Completely rewrite the compaction process to make for perfectly sized files with our having major
compactions but always manageable micro compactions.
225. Questions?
tiny.cloudera.com/sgquestions
Druid.IO
Client
Broker Cluster
Broker
Node
Broker
Node
Broker
Node
Real Time Cluster
Real Time
Node
Real Time
Node
Real Time
Node
History Cluster
History
Node
History
Node
History
Node
Pluggable
Storage
Batch Ingestion
Streaming
Ingestion
ZookKeeper
Metadata Storage
Query Planning and
Response Preparation
Try to optimize read
path based on time
request of the query
Async large batching
Main Ingestion Path
Short TTL for hot
memory cache
House keeping services
229. Questions?
tiny.cloudera.com/sgquestions
Why have batch processing?
▪ When you need a larger context
- Say, to train a model
▪ Complex periodic job that does something
- Convert data to a nested structure for reduced number of shuffles
▪ For example,
- Kudu -> HDFS Nested is batch processing
- KMeans calculation, etc.
241. Questions?
tiny.cloudera.com/sgquestions
REST Servers
import org.mortbay.jetty.Server
import org.mortbay.jetty.servlet.{Context, ServletHolder}
…
val server = new Server(port)
val sh = new ServletHolder(classOf[ServletContainer])
sh.setInitParameter("com.sun.jersey.config.property.resourceConfigClass",
"com.sun.jersey.api.core.PackagesResourceConfig")
sh.setInitParameter("com.sun.jersey.config.property.packages",
"com.hadooparchitecturebook.taxi360.server.hbase")
sh.setInitParameter("com.sun.jersey.api.json.POJOMappingFeature", "true”)
val context = new Context(server, "/", Context.SESSIONS)
context.addServlet(sh, "/*”)
server.start()
server.join()
247. Questions?
tiny.cloudera.com/sgquestions
SQL engine criteria
▪ Low latency SQL access
▪ Allows for high concurrency
▪ JDBC/ODBC integration
▪ Capable of large scale aggregation
▪ Optionally integrates with Multiple Storage Systems for real-time updates to SQL
tables