Supporting running Spark scripts directly from a browser would bring the user experience up. Indeed, everybody has a Web navigator, the command line can be avoided, built-in graphing and visualization make it easy to explore and understand data with just a few clicks. This also simplifies the administration as now everything becomes centralized in a service and is accessible by non native clients. For this purpose, an open source Spark Job Server was developed in order to provide Scala, SQL and Python in a Web shell. The main Hadoop components of the platform are also integrated in the same interface. This talk describes the architecture of the Spark Server and its main features: # Scala, Python, SQL submissions # Impersonation # Security # Job progress / canceling # YARN / HDFS / Hive integration The server also ships with a friendly user interface built as a Hue app. We will focus on explaining how they were built, how to use the API and which lessons were learned. The final end user interaction will be live demoed.
Hadoop Summit - Interactive Big Data Analysis with Solr, Spark and Huegethue
Open up your user base to the data! Almost everybody knows how to search. This talk describes through an interactive demo based on open source Hue how users can graphically search their data in Hadoop with Apache Solr. The session will detail how to get started with data indexing in just a few clicks and then explore several data analysis scenarios. The open source Hue search dashboard builder, with its draggable charts and dynamic interface lets any non-technical user look for documents or patterns. Attendees of this talk will learn how to get started with interactive search visualization in their Hadoop cluster.
Interactively Search and Visualize Your Big Datagethue
Open up your user base to the data! Contrary to programming and SQL, almost everybody knows how to search. This talk describes through an interactive demo based on open source Hue how users can graphically search their data in Hadoop. The underlying technical details of the application and its interaction with Apache Solr will be clarified.
The session will detail how to get started with data indexing in just a few clicks as well as explore several data analysis scenarios. Through a web browser, attendees will be shown how to explore and visualize data for quick answers. The new search dashboard in Hue, with its draggable charts and dynamic interface, lets any non-technical user look for documents or patterns.
Attendees of this talk will learn how to get started with interactive search visualization in their Hadoop cluster.
Apache Solr makes it so easy to interactively visualize and explore your data. Create a dashboard, add some facets, select some values, cross it with the time and just look at the results. Apache Spark is the growing framework for performing streaming computations, which makes it ideal for real time indexing. Solr also comes with new Analytics Facets which are a major weapon added to the arsenal of the data explorer. They bring another dimension: calculations. We can now do the equivalent of SQL, just in a much simpler and faster way. These calculations can operate over buckets of data.
SF Solr Meetup - Interactively Search and Visualize Your Big Datagethue
Open up your user base to the data! Contrary to programming and SQL, almost everybody knows how to search. This talk describes through an interactive demo based on open source Hue how users can graphically search their data in Hadoop. The underlying technical details of the application and its interaction with Apache Solr will be clarified.
The session will detail how to get started with data indexing in just a few clicks as well as explore several data analysis scenarios with the latest Solr Analytics Facets and Spark Streaming. Through a Web browser, attendees will be shown how to explore and visualize data for quick answers. The search dashboard in Hue, with its draggable charts and dynamic interface, lets any non-technical user look for documents or patterns.
Attendees of this talk will learn how to get started with interactive search visualization in their Solr cluster.
Hue: Big Data Web applications for Interactive Hadoop at Big Data Spain 2014gethue
This talk describes how open source Hue was built in order to provide a better Hadoop User Experience. The underlying technical details of its architecture, the lessons learned and how it integrates with Impala, Search and Spark under the cover will be explained.
The presentation continues with real life analytics business use cases. It will show how data can be easily imported into the cluster and then queried interactively with SQL or through a visual search dashboard. All through your Web Browser or your own custom Web application!
This talk aims at organizations trying to put a friendly “face” on Hadoop and get productive. Anybody looking at being more effective with Hadoop will also learn best practices and how to quickly get ramped up on the main data scenarios. Hue can be integrated with existing Hadoop deployments with minimal changes/disturbances. We cover details on how Hue interacts with the ecosystem and leverages the existing authentication and security model of your company.
To sum-up, attendees of this talk will learn how Hadoop can be made more accessible and why Hue is the ideal gateway for using it more efficiently or being the starting point of your own Big Data Web application.
Hadoop Summit - Interactive Big Data Analysis with Solr, Spark and Huegethue
Open up your user base to the data! Almost everybody knows how to search. This talk describes through an interactive demo based on open source Hue how users can graphically search their data in Hadoop with Apache Solr. The session will detail how to get started with data indexing in just a few clicks and then explore several data analysis scenarios. The open source Hue search dashboard builder, with its draggable charts and dynamic interface lets any non-technical user look for documents or patterns. Attendees of this talk will learn how to get started with interactive search visualization in their Hadoop cluster.
Interactively Search and Visualize Your Big Datagethue
Open up your user base to the data! Contrary to programming and SQL, almost everybody knows how to search. This talk describes through an interactive demo based on open source Hue how users can graphically search their data in Hadoop. The underlying technical details of the application and its interaction with Apache Solr will be clarified.
The session will detail how to get started with data indexing in just a few clicks as well as explore several data analysis scenarios. Through a web browser, attendees will be shown how to explore and visualize data for quick answers. The new search dashboard in Hue, with its draggable charts and dynamic interface, lets any non-technical user look for documents or patterns.
Attendees of this talk will learn how to get started with interactive search visualization in their Hadoop cluster.
Apache Solr makes it so easy to interactively visualize and explore your data. Create a dashboard, add some facets, select some values, cross it with the time and just look at the results. Apache Spark is the growing framework for performing streaming computations, which makes it ideal for real time indexing. Solr also comes with new Analytics Facets which are a major weapon added to the arsenal of the data explorer. They bring another dimension: calculations. We can now do the equivalent of SQL, just in a much simpler and faster way. These calculations can operate over buckets of data.
SF Solr Meetup - Interactively Search and Visualize Your Big Datagethue
Open up your user base to the data! Contrary to programming and SQL, almost everybody knows how to search. This talk describes through an interactive demo based on open source Hue how users can graphically search their data in Hadoop. The underlying technical details of the application and its interaction with Apache Solr will be clarified.
The session will detail how to get started with data indexing in just a few clicks as well as explore several data analysis scenarios with the latest Solr Analytics Facets and Spark Streaming. Through a Web browser, attendees will be shown how to explore and visualize data for quick answers. The search dashboard in Hue, with its draggable charts and dynamic interface, lets any non-technical user look for documents or patterns.
Attendees of this talk will learn how to get started with interactive search visualization in their Solr cluster.
Hue: Big Data Web applications for Interactive Hadoop at Big Data Spain 2014gethue
This talk describes how open source Hue was built in order to provide a better Hadoop User Experience. The underlying technical details of its architecture, the lessons learned and how it integrates with Impala, Search and Spark under the cover will be explained.
The presentation continues with real life analytics business use cases. It will show how data can be easily imported into the cluster and then queried interactively with SQL or through a visual search dashboard. All through your Web Browser or your own custom Web application!
This talk aims at organizations trying to put a friendly “face” on Hadoop and get productive. Anybody looking at being more effective with Hadoop will also learn best practices and how to quickly get ramped up on the main data scenarios. Hue can be integrated with existing Hadoop deployments with minimal changes/disturbances. We cover details on how Hue interacts with the ecosystem and leverages the existing authentication and security model of your company.
To sum-up, attendees of this talk will learn how Hadoop can be made more accessible and why Hue is the ideal gateway for using it more efficiently or being the starting point of your own Big Data Web application.
Faster Data Analytics with Apache Spark using Apache SolrChitturi Kiran
Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. Spark SQL allows users to execute relation queries in Spark with distributed in-memory computations. Though Spark gives us faster in-memory computations, Solr is blazing fast for some analytic queries. In this talk, we will take a deep dive into how to optimize the SQL queries from Spark to Solr by plugging into the Spark LogicalPlanner using pushdown strategies. The key take aways from the talk will be:
How to perform Spark SQL queries with Apache Solr?
What happens inside a Spark SQL query?
How to plug into Spark Logical Planner?
What type of push-down strategies are optimal with Solr?
Examples of push-down strategies
Presented at Lucene Revolution - http://sched.co/BAwV
Ingesting and Manipulating Data with JavaScriptLucidworks
Data in the wild isn’t always in the right format we need for search or even mere usability. Lucidworks Fusion offers powerful pipelines, parsers, and stages to wrangle your data into the right format to make it more findable and friendly. However, there are some cases where more obscure data will require the power of scripting.
Your data may need a complex transformation, a custom decryption algorithm, or you may already have existing code for handling a piece of data. Even in these more complex cases, Fusion’s JavaScript capabilities have got you covered.
This session will introduce and demonstrate several techniques for enhancing the search experience by augmenting documents during indexing. First we'll survey the analysis components available in Solr, and then we'll delve into using Solr's update processing pipeline to modify documents on the way in. The session will build on Erik's "Poor Man's Entity Extraction" blog at http://www.searchhub.org/2013/06/27/poor-mans-entity-extraction-with-solr/
Burn down the silos! Helping dev and ops gel on high availability websitesLindsay Holmwood
HA websites are where the rubber meets the road - at 200km/h. Traditional separation of dev and ops just doesn't cut it.
Everything is related to everything. Code relies on performant and resilient infrastructure, but highly performant infrastructure will only get a poorly written application so far. Worse still, root cause analysis in HA sites will more often than not identify problems that don't clearly belong to either devs or ops.
The two options are collaborate or die.
This talk will introduce 3 core principles for improving collaboration between operations and development teams: consistency, repeatability, and visibility. These principles will be investigated with real world case studies and associated technologies audience members can start using now. In particular, there will be a focus on:
- fast provisioning of test environments with configuration management
- reliable and repeatable automated deployments
- application and infrastructure visibility with statistics collection, logging, and visualisation
We describe the features of Oak Lucene indexes and how they can be used to get your queries perform better. In the second part we will talk about how asynchronous indexing works in general and how it can be monitored.
This was presented as part of AEM Gem Series -http://dev.day.com/content/ddc/en/gems/oak-lucene-indexes.html
Parse is a suite of cloud based APIs, services and libraries that focus on letting developers build out rich applications and less time dealing with the overhead of setting up and managing databases, push notifications, social sign on, analytics, and even hosting and servers.
In this series I'll overview the options around developing an application that leverages Parse, including using Cloud Code to deploy your Node.js app to Parse's own hosting service.
Faster Data Analytics with Apache Spark using Apache SolrChitturi Kiran
Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. Spark SQL allows users to execute relation queries in Spark with distributed in-memory computations. Though Spark gives us faster in-memory computations, Solr is blazing fast for some analytic queries. In this talk, we will take a deep dive into how to optimize the SQL queries from Spark to Solr by plugging into the Spark LogicalPlanner using pushdown strategies. The key take aways from the talk will be:
How to perform Spark SQL queries with Apache Solr?
What happens inside a Spark SQL query?
How to plug into Spark Logical Planner?
What type of push-down strategies are optimal with Solr?
Examples of push-down strategies
Presented at Lucene Revolution - http://sched.co/BAwV
Ingesting and Manipulating Data with JavaScriptLucidworks
Data in the wild isn’t always in the right format we need for search or even mere usability. Lucidworks Fusion offers powerful pipelines, parsers, and stages to wrangle your data into the right format to make it more findable and friendly. However, there are some cases where more obscure data will require the power of scripting.
Your data may need a complex transformation, a custom decryption algorithm, or you may already have existing code for handling a piece of data. Even in these more complex cases, Fusion’s JavaScript capabilities have got you covered.
This session will introduce and demonstrate several techniques for enhancing the search experience by augmenting documents during indexing. First we'll survey the analysis components available in Solr, and then we'll delve into using Solr's update processing pipeline to modify documents on the way in. The session will build on Erik's "Poor Man's Entity Extraction" blog at http://www.searchhub.org/2013/06/27/poor-mans-entity-extraction-with-solr/
Burn down the silos! Helping dev and ops gel on high availability websitesLindsay Holmwood
HA websites are where the rubber meets the road - at 200km/h. Traditional separation of dev and ops just doesn't cut it.
Everything is related to everything. Code relies on performant and resilient infrastructure, but highly performant infrastructure will only get a poorly written application so far. Worse still, root cause analysis in HA sites will more often than not identify problems that don't clearly belong to either devs or ops.
The two options are collaborate or die.
This talk will introduce 3 core principles for improving collaboration between operations and development teams: consistency, repeatability, and visibility. These principles will be investigated with real world case studies and associated technologies audience members can start using now. In particular, there will be a focus on:
- fast provisioning of test environments with configuration management
- reliable and repeatable automated deployments
- application and infrastructure visibility with statistics collection, logging, and visualisation
We describe the features of Oak Lucene indexes and how they can be used to get your queries perform better. In the second part we will talk about how asynchronous indexing works in general and how it can be monitored.
This was presented as part of AEM Gem Series -http://dev.day.com/content/ddc/en/gems/oak-lucene-indexes.html
Parse is a suite of cloud based APIs, services and libraries that focus on letting developers build out rich applications and less time dealing with the overhead of setting up and managing databases, push notifications, social sign on, analytics, and even hosting and servers.
In this series I'll overview the options around developing an application that leverages Parse, including using Cloud Code to deploy your Node.js app to Parse's own hosting service.
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...gethue
Livy is a new open source Spark REST Server for submitting and interacting with your Spark jobs from anywhere. Livy is conceptually based on the incredibly popular IPython/Jupyter, but implemented to better integrate into the Hadoop ecosystem with multi users. Spark can now be offered as a service to anyone in a simple way: Spark shells in Python or Scala can be ran by Livy in the cluster while the end user is manipulating them at his own convenience through a REST api. Regular non-interactive applications can also be submitted. The output of the jobs can be introspected and returned in a tabular format, which makes it visualizable in charts. Livy can point to a unique Spark cluster and create several contexts by users. With YARN impersonation, jobs will be executed with the actual permissions of the users submitting them. Livy also enables the development of Spark Notebook applications. Those are ideal for quickly doing interactive Spark visualizations and collaboration from a Web browser! This talk is technical and details the architecture and design decisions taken for developing this server, as well as its internals. It also describes the alternatives we tried and the challenges that were faced. The capabilities of Livy will then be lived demo in Hue’s Notebook Application through a real life scenario.
https://spark-summit.org/eu-2015/events/building-a-rest-job-server-for-interactive-spark-as-a-service/
Agenda:
• Brief overview of Spark provided spark-shell, spark-submit
• Overview of Spark ContextOverview of Zeppelin and Jupyter notebooks for Spark
• Introduction to IBM Spark Kernel
• Introduction to Cloudera Livy and Spark JobServer
Github Link:
Previous meetups:-
1) Introduction to Resilient Distributed Dataset and deep dive
Slides: http://www.slideshare.net/differentsachin/apache-spark-introduction-and-resilient-distributed-dataset-basics-and-deep-dive
Meetup: http://www.meetup.com/Big-Data-Developers-in-Bangalore/events/225159947/
Video: https://www.youtube.com/watch?v=MkeRWyF1y_0
Github: https://github.com/SatyaNarayan1/spark_meetup
2) Introduction to Spark DataFrames/SQL and Deep dive
Slides: http://www.slideshare.net/sachinparmarss/deep-dive-spark-data-frames-sql-and-catalyst-optimizer
Meetup: http://www.meetup.com/Big-Data-Developers-in-Bangalore/events/226419828/
Video: https://www.youtube.com/watch?v=h71MNWRv99M
Github: https://github.com/parmarsachin/spark-dataframe-demo
3) Apache Spark - Introduction to Spark Streaming and Deep dive
Slides: http://www.slideshare.net/differentsachin/apache-spark-introduction-to-spark-streaming-and-deep-dive-57671774
Meetup: http://www.meetup.com/Big-Data-Developers-in-Bangalore/events/227008581/
Video:
Github: https://github.com/agsachin/spark-meetup
Looking forward to have a great interactive session. Do provide feedback.
Apache Spark Streaming: Architecture and Fault ToleranceSachin Aggarwal
Agenda:
• Spark Streaming Architecture
• How different is Spark Streaming from other streaming applications
• Fault Tolerance
• Code Walk through & demo
• We will supplement theory concepts with sufficient examples
Speakers :
Paranth Thiruvengadam (Architect (STSM), Analytics Platform at IBM Labs)
Profile : https://in.linkedin.com/in/paranth-thiruvengadam-2567719
Sachin Aggarwal (Developer, Analytics Platform at IBM Labs)
Profile : https://in.linkedin.com/in/nitksachinaggarwal
Github Link: https://github.com/agsachin/spark-meetup
Recent Developments In SparkR For Advanced AnalyticsDatabricks
Since its introduction in Spark 1.4, SparkR has received contributions from both the Spark community and the R community. In this talk, we will summarize recent community efforts on extending SparkR for scalable advanced analytics. We start with the computation of summary statistics on distributed datasets, including single-pass approximate algorithms. Then we demonstrate MLlib machine learning algorithms that have been ported to SparkR and compare them with existing solutions on R, e.g., generalized linear models, classification and clustering algorithms. We also show how to integrate existing R packages with SparkR to accelerate existing R workflows.
A Journey into Databricks' Pipelines: Journey and Lessons LearnedDatabricks
With components like Spark SQL, MLlib, and Streaming, Spark is a unified engine for building data applications. In this talk, we will take a look at how we use Spark on our own Databricks platform throughout our data pipeline for use cases such as ETL, data warehousing, and real time analysis. We will demonstrate how these applications empower engineering and data analytics. We will also share some lessons learned from building our data pipeline around security and operations. This talk will include examples on how to use Structured Streaming (a.k.a Streaming DataFrames) for online analysis, SparkR for offline analysis, and how we connect multiple sources to achieve a Just-In-Time Data Warehouse.
This is an quick introduction to Scalding and Monoids. Scalding is a Scala library that makes writing MapReduce jobs very easy. Monoids on the other hand promise parallelism and quality and they make some more challenging algorithms look very easy.
The talk was held at the Helsinki Data Science meetup on January 9th 2014.
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
This presentation introduces Apache Flink, a massively parallel data processing engine which currently undergoes the incubation process at the Apache Software Foundation. Flink's programming primitives are presented and it is shown how easily a distributed PageRank algorithm can be implemented with Flink. Intriguing features such as dedicated memory management, Hadoop compatibility, streaming and automatic optimisation make it an unique system in the world of Big Data processing.
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...Data Con LA
This presentation will explore how Bloomberg uses Spark, with its formidable computational model for distributed, high-performance analytics, to take this process to the next level, and look into one of the innovative practices the team is currently developing to increase efficiency: the introduction of a logical signature for datasets.
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...Provectus
In this demo based talk with live coding, we’ll present a functional typeful framework for developing Apache Spark applications. We’ll walk through the following key topics: – turning unmanageable Spark scripts into typeful Spark Functions – serverless deployment of Spark functions into the cloud – unit testing Spark functions to save cluster resources and developers time – seamless Spark session management between concurrent Spark jobs in exclusive or share modes
Best Hadoop Institutes : kelly tecnologies is the best Hadoop training Institute in Bangalore.Providing hadoop courses by realtime faculty in Bangalore.
This is the slide for what I shared in JS Group meetup, 2014, Taiwan. It covers what JavaScript could do for making the program more "functional", the benefits, price and the limitation.
You may all know that JSON is a subset of JavaScript, but… Did you know that HTML5 implements NoSQL databases? Did you know that JavaScript was recommended for REST by HTTP co-creator Roy T. Fielding himself? Did you know that map & reduce are part of the native JavaScript API? Did you know that most NoSQL solutions integrate a JavaScript engine? CouchDB, MongoDB, WakandaDB, ArangoDB, OrientDB, Riak…. And when they don’t, they have a shell client which does. The story of NoSQL and JavaScript goes beyond your expectations and opens more opportunities than you might imagine… What better match could you find than a flexible and dynamic language for schemaless databases? Isn’t an event-driven language what you’ve been waiting for to manage consistency? When NoSQL doesn’t come to JavaScript, JavaScript comes to NoSQL. And does it very well.
Apache Spark jest narzędziem do przetwarzania danych na dużą skalę. Zastosowanie tego narzędzia w rozproszonym środowisku, w celu przetwarzania dużych zbiorów danych daje ogromne korzyści.
Ale co z szybką pętlą zwrotną podczas opracowywania aplikacji z użyciem Apache Spark? Testowanie aplikacji w klastrze jest niezbędne, lecz nie wydaje się być tym, do czego większość programistów przywykło podczas praktykowania TDD.
Podczas wystąpienia, Łukasz podzielił się z kilkoma wskazówkami, jak można napisać testy jednostkowe oraz integracyjne i jak Docker może być używany do testowania Sparka na lokalnej maszynie.
Testing batch and streaming Spark applicationsŁukasz Gawron
Apache Spark is a general engine for processing data on a large scale. Employing this tool in a distributed environment to process large data sets is undeniably beneficial.
But what about fast feedback loop while developing such application with Apache Spark? Testing it on a cluster is essential, but it does not seem to be what most developers accustomed to TDD workflow would like to do.
In the talk, ŁLLukasz will share with you some tips on how to write the unit and integration tests, and how Docker can be applied to test Spark application on a local machine.
Examples will be presented within the ScalaTest framework, and it should be easy to grasp by people who know Scala and other JVM languages.
Hadoop became the most common systm to store big data.
With Hadoop, many supporting systems emerged to complete the aspects that are missing in Hadoop itself.
Together they form a big ecosystem.
This presentation covers some of those systems.
While not capable to cover too many in one presentation, I tried to focus on the most famous/popular ones and on the most interesting ones.
Similar to Big Data Scala by the Bay: Interactive Spark in your Browser (20)
Learn about the HBase browser in Hue, the UI for Apache Hadoop.
Presented by Abraham Elmahrek at Hadoop Israel www.meetup.com/HadoopIsrael/events/161701092/
Find out everything you need about Hue at http://gethue.com
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetupgethue
This talk will describe how Hue can be integrated with existing Hadoop deployments with minimal changes/disturbances. Romain will cover details on how Hue can leverage the existing authentication system and security model of your company. He will also cover the Hive/Shark/Pig/Oozie best practice setup for Hue.
http://www.meetup.com/hadoop/events/125191612/
Learn about Hue, the UI for Apache Hadoop.
Presented by Enrico Berti at the Hadoop Singapore meetup.
Find out everything you need about Hue at http://gethue.com
Hue is an open source Hadoop Web UI that lets users be more productive, while also providing a framework for building new apps quickly. Get a tour of Hue features and learn how to re-use the APIs for submitting Hive queries, listing HDFS files, and submitting MapReduce jobs.
Hue: The Hadoop UI - Where we stand, Hue Meetup SF gethue
Learn about all the new features of Hue 3.5+ that is included in Cloudera CDH 5.
Presented in San Francisco @ Cloudera by Romain Rigaux and Abe Elmahrek
Learn about the HBase browser in Hue, the UI for Apache Hadoop.
Presented by Abraham Elmahrek at the LA HBase user meetup http://www.meetup.com/Los-Angeles-HBase-User-group/events/152073322/
Find out everything you need about Hue at http://gethue.com
Learn about Hue, the UI for Apache Hadoop.
Presented by Enrico Berti at the HUG France meetup http://hugfrance.fr/meetup-le-11-decembre-2013/
Find out everything you need about Hue at http://gethue.com
Learn about Hue, the UI for Apache Hadoop.
Presented by Enrico Berti at the HUG Stockholm meetup.
Find out everything you need about Hue at http://gethue.com
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
16. • REST Web server in Scala
• Interactive Spark Sessions and Batch Jobs
• Type Introspection for Visualization
• Running sessions in YARN local
• Backends: Scala, Python, R
• Open Source:
https://github.com/cloudera/hue/tree/master/app
s/spark/java
LIVY
SPARK SERVER
17. LIVY WEB SERVER
ARCHITECTURE
YARN
Master
Spark Client
YARN
Node
Spark
Interpreter
Spark
Context
YARN
Node
Spark
Worker
YARN
Node
Spark
Worker
Livy Server
Scalatra
Session Manager
Session
18. LIVY WEB SERVER
ARCHITECTURE
Livy Server
YARN
Master
Scalatra
Spark Client
Session Manager
Session
YARN
Node
Spark
Interpreter
Spark
Context
YARN
Node
Spark
Worker
YARN
Node
Spark
Worker
1
27. INTERPRETERS
• Pipe stdin/stdout to a running shell
• Execute the code / send to Spark workers
• Perform magic operations
• One interpreter by language
• “Swappable” with other kernels (python,
spark..)
Interpreter
> println(1 + 1)
2
println(1 + 1)
2
29. INTERPRETER FLOW CHART
Receive lines Split lines
Send output
to server
Success
Incomplete
Merge with
next line
Error
Execute LineMagic!
Lines
left?
Magic line?
No
Yes
NoYes
30. LIVY INTERPRETERS
trait Interpreter {
def state: State
def execute(code: String): Future[JValue]
def close(): Unit
}
sealed trait State
case class NotStarted() extends State
case class Starting() extends State
case class Idle() extends State
case class Running() extends State
case class Busy() extends State
case class Error() extends State
case class ShuttingDown() extends State
case class Dead() extends State
31. LIVY INTERPRETERS
trait Interpreter {
def state: State
def execute(code: String): Future[JValue]
def close(): Unit
}
sealed trait State
case class NotStarted() extends State
case class Starting() extends State
case class Idle() extends State
case class Running() extends State
case class Busy() extends State
case class Error() extends State
case class ShuttingDown() extends State
case class Dead() extends State
32. SPARK INTERPRETER
class SparkInterpeter extends Interpreter {
…
private var _state: State = NotStarted()
private val outputStream = new ByteArrayOutputStream()
private var sparkIMain: SparkIMain = _
def start() = {
...
_state = Starting()
sparkIMain = new SparkIMain(new Settings(), new JPrintWriter(outputStream, true))
sparkIMain.initializeSynchronous()
...
Interpreter
new SparkIMain(new Settings(), new JPrintWriter(outputStream, true))
33. SPARK INTERPRETER
private var sparkContext: SparkContext = _
def start() = {
...
val sparkConf = new SparkConf(true)
sparkContext = new SparkContext(sparkConf)
sparkIMain.beQuietDuring {
sparkIMain.bind("sc", "org.apache.spark.SparkContext",
sparkContext, List("""@transient"""))
}
_state = Idle()
}
sparkIMain.bind("sc", "org.apache.spark.SparkContext",
sparkContext, List("""@transient"""))
34. EXECUTING SPARK
private def executeLine(code: String): ExecuteResult = {
code match {
case MAGIC_REGEX(magic, rest) =>
executeMagic(magic, rest)
case _ =>
scala.Console.withOut(outputStream) {
sparkIMain.interpret(code) match {
case Results.Success => ExecuteComplete(readStdout())
case Results.Incomplete => ExecuteIncomplete(readStdout())
case Results.Error => ExecuteError(readStdout())
}
...
case MAGIC_REGEX(magic, rest) =>
case _ =>
35. INTERPRETER MAGIC
private val MAGIC_REGEX = "^%(w+)W*(.*)".r
private def executeMagic(magic: String, rest: String): ExecuteResponse = {
magic match {
case "json" => executeJsonMagic(rest)
case "table" => executeTableMagic(rest)
case _ => ExecuteError(f"Unknown magic command $magic")
}
}
case "json" => executeJsonMagic(rest)
case "table" => executeTableMagic(rest)
case _ => ExecuteError(f"Unknown magic command $magic")
36. INTERPRETER MAGIC
private def executeJsonMagic(name: String): ExecuteResponse = {
sparkIMain.valueOfTerm(name) match {
case Some(value: RDD[_]) => ExecuteMagic(Extraction.decompose(Map(
"application/json" -> value.asInstanceOf[RDD[_]].take(10))))
case Some(value) => ExecuteMagic(Extraction.decompose(Map(
"application/json" -> value)))
case None => ExecuteError(f"Value $name does not exist")
}
}
case Some(value: RDD[_]) => ExecuteMagic(Extraction.decompose(Map(
"application/json" -> value.asInstanceOf[RDD[_]].take(10))))
case Some(value) => ExecuteMagic(Extraction.decompose(Map(
"application/json" -> value)))
Why do we want to do this? Currently it’s difficult to visualize results from Spark. Spark has a great interactive tool called “spark-shell” that allows you to interact with large datasets on the commandline. For example, here is a session where we are counting the words used by shakespeare. Running this computation is easy, but spark-shell doesn’t provide any tools for visualizing the results.
One option is to save the output to a file, then use a tool like Hue to import it into a Hive table and visualize it. We are obviously big fans of Hue, but there are still too many steps to go through to get to this point. If we want to change the script, say to filter out words like “the” and “and”, we need to go back to the shell, rerun our code snippet, save it to a file, then reimport it into the UI. It’s a slow process.
Multi languages
Inherit Hue’s sharing, export/import
Hello, I’m Erick Tryzelaar, and I’m going to talk about the Livy Spark Server, which is our backend for Hue’s Notebook application.
Livy is a REST web server that allows a tool like Hue to interactively execute scala and spark commands, just like spark-shell. It goes beyond it by adding type introspection, which allows a frontend like Hue to render results in interactive visualizations. Furthermore it allows sessions to be run inside YARN to support horizontally scaling out to hundreds of active sessions. It also supports a Python and R backend. Finally, it’s fully open source, and currently being developed in Hue.
The Livy server is built upon Scalatra and Jetty. Creating a session is as simple as POSTing to a particular URL. Behind the scenes, livy will communicate with the YARN master to allocate some nodes to launch the interactive sessions. This is all done asynchronously as there’s not telling when there will be resources available to run the sessions. Once the nodes have been allocated, Livy will start an intepreter on one of the nodes which takes care of creating the Spark Context, which actually runs the spark operations. After it’s setup, the session signals to the Livy Server that it’s ready for commands. At that point, the Client can simply POST their code to a url on the livy server.
The Livy server is built upon Scalatra and Jetty. Creating a session is as simple as POSTing to a particular URL. Behind the scenes, livy will communicate with the YARN master to allocate some nodes to launch the interactive sessions. This is all done asynchronously as there’s not telling when there will be resources available to run the sessions. Once the nodes have been allocated, Livy will start an intepreter on one of the nodes which takes care of creating the Spark Context, which actually runs the spark operations. After it’s setup, the session signals to the Livy Server that it’s ready for commands. At that point, the Client can simply POST their code to a url on the livy server.
The Livy server is built upon Scalatra and Jetty. Creating a session is as simple as POSTing to a particular URL. Behind the scenes, livy will communicate with the YARN master to allocate some nodes to launch the interactive sessions. This is all done asynchronously as there’s not telling when there will be resources available to run the sessions. Once the nodes have been allocated, Livy will start an intepreter on one of the nodes which takes care of creating the Spark Context, which actually runs the spark operations. After it’s setup, the session signals to the Livy Server that it’s ready for commands. At that point, the Client can simply POST their code to a url on the livy server.
The Livy server is built upon Scalatra and Jetty. Creating a session is as simple as POSTing to a particular URL. Behind the scenes, livy will communicate with the YARN master to allocate some nodes to launch the interactive sessions. This is all done asynchronously as there’s not telling when there will be resources available to run the sessions. Once the nodes have been allocated, Livy will start an intepreter on one of the nodes which takes care of creating the Spark Context, which actually runs the spark operations. After it’s setup, the session signals to the Livy Server that it’s ready for commands. At that point, the Client can simply POST their code to a url on the livy server.
The Livy server is built upon Scalatra and Jetty. Creating a session is as simple as POSTing to a particular URL. Behind the scenes, livy will communicate with the YARN master to allocate some nodes to launch the interactive sessions. This is all done asynchronously as there’s not telling when there will be resources available to run the sessions. Once the nodes have been allocated, Livy will start an intepreter on one of the nodes which takes care of creating the Spark Context, which actually runs the spark operations. After it’s setup, the session signals to the Livy Server that it’s ready for commands. At that point, the Client can simply POST their code to a url on the livy server.
The Livy server is built upon Scalatra and Jetty. Creating a session is as simple as POSTing to a particular URL. Behind the scenes, livy will communicate with the YARN master to allocate some nodes to launch the interactive sessions. This is all done asynchronously as there’s not telling when there will be resources available to run the sessions. Once the nodes have been allocated, Livy will start an intepreter on one of the nodes which takes care of creating the Spark Context, which actually runs the spark operations. After it’s setup, the session signals to the Livy Server that it’s ready for commands. At that point, the Client can simply POST their code to a url on the livy server.
The Livy server is built upon Scalatra and Jetty. Creating a session is as simple as POSTing to a particular URL. Behind the scenes, livy will communicate with the YARN master to allocate some nodes to launch the interactive sessions. This is all done asynchronously as there’s not telling when there will be resources available to run the sessions. Once the nodes have been allocated, Livy will start an intepreter on one of the nodes which takes care of creating the Spark Context, which actually runs the spark operations. After it’s setup, the session signals to the Livy Server that it’s ready for commands. At that point, the Client can simply POST their code to a url on the livy server.
The Livy server is built upon Scalatra and Jetty. Creating a session is as simple as POSTing to a particular URL. Behind the scenes, livy will communicate with the YARN master to allocate some nodes to launch the interactive sessions. This is all done asynchronously as there’s not telling when there will be resources available to run the sessions. Once the nodes have been allocated, Livy will start an intepreter on one of the nodes which takes care of creating the Spark Context, which actually runs the spark operations. After it’s setup, the session signals to the Livy Server that it’s ready for commands. At that point, the Client can simply POST their code to a url on the livy server.
Let’s see it in action. On the left we see creating a “spark” session. You could also fill in “pyspark” and “sparkR” here if you want those sessions. On the right is us executing simple math in the session itself.
We don’t have too much time to drill down into the code, but we did want to take this moment to at least dive into how the interpreters work.
Livy’s interpreters are conceptually very simple devices. They take in one or more lines of code and execute them in a shell environment. These shells perform the computation and interact with the spark environment. They’re also abstract. As I mentioned earlier, Livy currently has 3 languages built into it: Scala, Python and R, with more to come.
Here is the interpreter loop that livy manages. First is to split up the lines and feed them one at a time into the interpreter. If the line is a regular, non-magic line, it gets executed and the result can be of three states. Success, where we’ll continue to execute the next line, incomplete, where the input is not a complete statement, such as an “if” statement with an open bracket. Or an error, which stops the execution of these lines. The other case are magic lines, which are special commands to the interpreter itself. For example, asking the interpreter to convert a value into a json type.
Now for some code. As we saw earlier, the interpreter is a simple state machine that executes code and eventually produces JSON responses by way of a Future.
Now for some code. As we saw earlier, the interpreter is a simple state machine that executes code and eventually produces JSON responses by way of a Future.
In order to implement this interface, the spark interpreter needs to first create the real interpreter, SparkIMain. It’s pretty simple to create. We just need to construct it with a buffer that acts as the interpreters Standard Output.
Once the SparkIMain has been initialized, we need to create the Spark Context that communicates with all of the spark workers. Injecting this variable into the interpreter is quite simple with this “bind” method.
Now that the session is up and running we can execute code inside of it. I’ve skipped some of the other book keeping in order to show the actual heart of the execution here. Ignore the magic case at the moment. Execution is also quite simple, we first temporarily replace standard out with our buffer, and then have the interpreter execute the code. There are three conditions for the response. First the command executed. Second, this code is incomplete because maybe it has an open parenthesis. Finally, an error if some exception occurred. Altogether quite simple and doesn’t require any changes to Spark to do this.
And now the magic. I mentioned earlier that livy supports type introspection. The way it does it is through these in-band magic commands which start with the percent command. The spark interpreter currently supports two magic commands, “json” and “table”. The “json” will convert any type into a json value, and “table” will convert any type into a table-ish object that’s used for our visualization.
Here is our json magic. it takes advantage of json4s’s Extraction.decompose to try to convert values. We special case RDDs since they can’t be directly transformed into json. Instead we just pull out the first 10 items so we can at least show something.
The table magic does something similar, but it’s a bit large to compress into slides. We’ll see it’s results next.
The table magic does something similar, but it’s a bit large to compress into slides. We’ll see it’s results next.
Finally here it is in action. Here we’re taking our shakespeare code from earlier. If we run this snippet inside livy, it returns an output mimetype of application/json, with the results inlined without encoding in the output.
Finally here it is in action. Here we’re taking our shakespeare code from earlier. If we run this snippet inside livy, it returns an output mimetype of application/json, with the results inlined without encoding in the output.
Fingers crossed for a lot of reasons, it’s master and the VM was broken till 4 AM.
Next: learn more