This document discusses how big data and machine learning can be used to gain insights from large datasets and answer complex questions. It describes challenges in working with big data like data cleaning, modeling large datasets, and limitations of traditional tools. It then introduces H2O as a platform for performing fast, distributed machine learning on big data through an in-memory key-value store, distributed fork/join framework, and APIs for math hacking and model building. H2O aims to allow users to manipulate big data interactively like small data through its distributed, parallel architecture.
This is a very ^2 basic introduction to R.
The purpose of this presentation is to prepare you with all that you have to know about fundamentals of using R to operate data frames, which you can easily get by importing data from relational database table or csv/text file.
Big data analysis in python @ PyCon.tw 2013Jimmy Lai
Big data analysis involves several processes: collecting, storage, computing, analysis and visualization. In this slides, the author demonstrates these processes by using python tools to build a data product. The example is based on text-analyzing an online forum.
We all know that MongoDB is one of the most flexible and feature-rich databases available. In this session we'll discuss how you can leverage this feature set and maintain high performance with your project's massive data sets and high loads. We'll cover how indexes can be designed to optimize the performance of MongoDB. We'll also discuss tips for diagnosing and fixing performance issues should they arise.
This talk provides an engineering perspective on privacy protection. The intended audience is architects, developers, data scientists, and engineering managers that build applications handling user data. We highlight topics that require attention at an early design stage, and go through pitfalls and potentially expensive architectural mistakes. We describe a number of technical patterns for complying with privacy regulations without sacrificing the ability to use data for product features. The content of the talk is based on real world experience from handling privacy protection in large scale data processing environments.
Mystery Pictures taken by the third grade Techie Kids using digital cameras, no flash, and the macro setting. We discussed looking at thing in a new or interesting way.
This is a very ^2 basic introduction to R.
The purpose of this presentation is to prepare you with all that you have to know about fundamentals of using R to operate data frames, which you can easily get by importing data from relational database table or csv/text file.
Big data analysis in python @ PyCon.tw 2013Jimmy Lai
Big data analysis involves several processes: collecting, storage, computing, analysis and visualization. In this slides, the author demonstrates these processes by using python tools to build a data product. The example is based on text-analyzing an online forum.
We all know that MongoDB is one of the most flexible and feature-rich databases available. In this session we'll discuss how you can leverage this feature set and maintain high performance with your project's massive data sets and high loads. We'll cover how indexes can be designed to optimize the performance of MongoDB. We'll also discuss tips for diagnosing and fixing performance issues should they arise.
This talk provides an engineering perspective on privacy protection. The intended audience is architects, developers, data scientists, and engineering managers that build applications handling user data. We highlight topics that require attention at an early design stage, and go through pitfalls and potentially expensive architectural mistakes. We describe a number of technical patterns for complying with privacy regulations without sacrificing the ability to use data for product features. The content of the talk is based on real world experience from handling privacy protection in large scale data processing environments.
Mystery Pictures taken by the third grade Techie Kids using digital cameras, no flash, and the macro setting. We discussed looking at thing in a new or interesting way.
Student motivation, by: Haseen Ah-HassanHaseeb Ahmed
I'm student from Zakho University English department, as any other students I had my own presentation in ELT (English Language Teaching) about Student Motivation, then when I got excellent for my presentation I decided to shared it with everyone.
Triangle Inequality Theorem: Activities and Assessment MethodsMarianne McFadden
A comprehensive lesson on the Triangle Inequality Theorem, including pre-assessment, a hands-on activity (with rubric), and post-assessment methods that measure varying levels of achievement.
Game On: Everything you need to know about how games are changing the worldJeremy Johnson
Gaming is at a tipping point, never before have games effected our day-to-day lives in such a substantial way. From entertaining yourself on the subway with Angry Birds, to solving the world's greatest problems - gaming is quickly becoming a mainstream way to explore, communicate, connect, and work.
With "Game On" Jeremy Johnson will take you on a tour of gaming trends - which includes everyone's favorite gaming buzz words: gamification, gameful, game layer, gamestorming, game mechanics, gameplay, game theory and good old video games. How's that for a extra helping of games? Let's top it off with a Call of Duty deathmatch - who's game?
This presentation was given at Big Design 2011 in Dallas Texas. #bigd11
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Student motivation, by: Haseen Ah-HassanHaseeb Ahmed
I'm student from Zakho University English department, as any other students I had my own presentation in ELT (English Language Teaching) about Student Motivation, then when I got excellent for my presentation I decided to shared it with everyone.
Triangle Inequality Theorem: Activities and Assessment MethodsMarianne McFadden
A comprehensive lesson on the Triangle Inequality Theorem, including pre-assessment, a hands-on activity (with rubric), and post-assessment methods that measure varying levels of achievement.
Game On: Everything you need to know about how games are changing the worldJeremy Johnson
Gaming is at a tipping point, never before have games effected our day-to-day lives in such a substantial way. From entertaining yourself on the subway with Angry Birds, to solving the world's greatest problems - gaming is quickly becoming a mainstream way to explore, communicate, connect, and work.
With "Game On" Jeremy Johnson will take you on a tour of gaming trends - which includes everyone's favorite gaming buzz words: gamification, gameful, game layer, gamestorming, game mechanics, gameplay, game theory and good old video games. How's that for a extra helping of games? Let's top it off with a Call of Duty deathmatch - who's game?
This presentation was given at Big Design 2011 in Dallas Texas. #bigd11
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Monitoring Big Data Systems - "The Simple Way"Demi Ben-Ari
Once you start working with distributed Big Data systems, you start discovering a whole bunch of problems you won’t find in monolithic systems.
All of a sudden to monitor all of the components becomes a big data problem itself.
In the talk we’ll mention all of the aspects that you should take in consideration when monitoring a distributed system once you’re using tools like:
Web Services, Apache Spark, Cassandra, MongoDB, Amazon Web Services.
Not only the tools, what should you monitor about the actual data that flows in the system?
And we’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
Demi Ben-Ari is a Co-Founder and CTO @ Panorays.
Demi has over 9 years of experience in building various systems both from the field of near real time applications and Big Data distributed systems.
Describing himself as a software development groupie, Interested in tackling cutting edge technologies.
Demi is also a co-founder of the “Big Things” Big Data community: http://somebigthings.com/big-things-intro/
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned Omid Vahdaty
A while ago I entered the challenging world of Big Data. As an engineer, at first, I was not so impressed with this field. As time went by, I realised more and more, The technological challenges in this area are too great to master by one person. Just look at the picture in this articles, it only covers a small fraction of the technologies in the Big Data industry…
Consequently, I created a meetup detailing all the challenges of Big Data, especially in the world of cloud. I am using AWS & GCP and Data Center infrastructure to answer the basic questions of anyone starting their way in the big data world.
how to transform data (TXT, CSV, TSV, JSON) into Parquet, ORC,AVRO which technology should we use to model the data ? EMR? Athena? Redshift? Spectrum? Glue? Spark? SparkSQL? GCS? Big Query? Data flow? Data Lab? tensor flow? how to handle streaming? how to manage costs? Performance tips? Security tip? Cloud best practices tips?
In this meetup we shall present lecturers working on several cloud vendors, various big data platforms such hadoop, Data warehourses , startups working on big data products. basically - if it is related to big data - this is THE meetup.
Some of our online materials (mixed content from several cloud vendor):
Website:
https://big-data-demystified.ninja (under construction)
Meetups:
https://www.meetup.com/Big-Data-Demystified
https://www.meetup.com/AWS-Big-Data-Demystified/
You tube channels:
https://www.youtube.com/channel/UCMSdNB0fGmX5dXI7S7Y_LFA?view_as=subscriber
https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber
Audience:
Data Engineers
Data Science
DevOps Engineers
Big Data Architects
Solution Architects
CTO
VP R&D
Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty
What we're about
A while ago I entered the challenging world of Big Data. As an engineer, at first, I was not so impressed with this field. As time went by, I realised more and more, The technological challenges in this area are too great to master by one person. Just look at the picture in this articles, it only covers a small fraction of the technologies in the Big Data industry…
Consequently, I created a meetup detailing all the challenges of Big Data, especially in the world of cloud. I am using AWS infrastructure to answer the basic questions of anyone starting their way in the big data world.
how to transform data (TXT, CSV, TSV, JSON) into Parquet, ORCwhich technology should we use to model the data ? EMR? Athena? Redshift? Spectrum? Glue? Spark? SparkSQL?how to handle streaming?how to manage costs?Performance tips?Security tip?Cloud best practices tips?
Some of our online materials:
Website:
https://big-data-demystified.ninja/
Youtube channels:
https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber
https://www.youtube.com/channel/UCMSdNB0fGmX5dXI7S7Y_LFA?view_as=subscriber
Meetup:
https://www.meetup.com/AWS-Big-Data-Demystified/
https://www.meetup.com/Big-Data-Demystified
Facebook Group :
https://www.facebook.com/groups/amazon.aws.big.data.demystified/
Facebook page (https://www.facebook.com/Amazon-AWS-Big-Data-Demystified-1832900280345700/)
Audience:
Data Engineers
Data Science
DevOps Engineers
Big Data Architects
Solution Architects
CTO
VP R&D
Dirty data? Clean it up! - Datapalooza Denver 2016Dan Lynn
Dan Lynn (AgilData) & Patrick Russell (Craftsy) present on how to do data science in the real world. We discuss data cleansing, ETL, pipelines, hosting, and share several tools used in the industry.
DSD-INT 2017 The use of big data for dredging - De BoerDeltares
Presentation by Gerben de Boer (van Oord) at the Symposium Earth Observation and Data Science, during Delft Software Days - Edition 2017. Thursday, 2 November 2017, Delft.
Garbage in, garbage out - we have all heard about the importance of data quality. Having high quality data is essential for all types of use cases, whether it is reporting, anomaly detection, or for avoiding bias in machine learning applications. But where does high quality data come from? How can one assess data quality, improve quality if necessary, and prevent bad quality from slipping in? Obtaining good data quality involves several engineering challenges. In this presentation, we will go through tools and strategies that help us measure, monitor, and improve data quality. We will enumerate factors that can cause data collection and data processing to cause data quality issues, and we will show how to use engineering to detect and mitigate data quality problems.
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Sri Ambati
Sandeep Singh, Head of Applied AI Computer Vision, Beans.ai
H2O Open Source GenAI World SF 2023
In the modern era of machine learning, leveraging both open-source and closed-source solutions has become paramount for achieving cutting-edge results. This talk delves into the intricacies of seamlessly integrating open-source Large Language Model (LLM) solutions like Vicuna, Falcon, and Llama with industry giants such as ChatGPT and Google's Palm. As the demand for fine-tuned and specialized datasets grows, it is imperative to understand the synergy between these tools. Attendees will gain insights into best practices for building and enriching datasets tailored for fine-tuning tasks, ensuring that their LLM projects are both robust and efficient. Through real-world examples and hands-on demonstrations, this talk will equip attendees with the knowledge to harness the power of both open and closed-source tools in a coherent and effective manner.
Patrick Hall, Professor, AI Risk Management, The George Washington University
H2O Open Source GenAI World SF 2023
Language models are incredible engineering breakthroughs but require auditing and risk management before productization. These systems raise concerns about toxicity, transparency and reproducibility, intellectual property licensing and ownership, disinformation and misinformation, supply chains, and more. How can your organization leverage these new tools without taking on undue or unknown risks? While language models and associated risk management are in their infancy, a small number of best practices in governance and risk are starting to emerge. If you have a language model use case in mind, want to understand your risks, and do something about them, this presentation is for you!
Dr. Alexy Khrabrov, Open Source Science Community Director, IBM
H2O Open Source GenAI World SF 2023
In this talk, Dr. Alexy Khrabrov, recently elected Chair of the new Generative AI Commons at Linux Foundation for AI & Data, outlines the OSS AI landscape, challenges, and opportunities. With new models and frameworks being unveiled weekly, one thing remains constant: community building and validation of all aspects of AI is key to reliable and responsible AI we can use for business and society needs. Industrial AI is one key area where such community validation can prove invaluable.
Michelle Tanco, Head of Product, H2O.ai
H2O Open Source GenAI World SF 2023
Learn how the makers at H2O.ai are building internal tools to solve real use cases using H2O Wave and h2oGPT. We will walk through an end-to-end use case and discuss how to incorporate business rules and generated content to rapidly develop custom AI apps using only Python APIs.
Applied Gen AI for the Finance Vertical Sri Ambati
Megan Kurka, Vice President, Customer Data Scientist, H2O.ai
H2O Open Source GenAI World SF 2023
Discover the transformative power of Applied Gen AI. Learn how the H2O team builds customized applications and workflows that integrate capabilities of Gen AI and AutoML specifically designed to address and enhance financial use cases. Explore real world examples, learn best practices, and witness firsthand how our innovative solutions are reshaping the landscape of finance technology.
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Sri Ambati
Pascal Pfeiffer, Principal Data Scientist, H2O.ai
H2O Open Source GenAI World SF 2023
This talk dives into the expansive ecosystem of Large Language Models (LLMs), offering practitioners an insightful guide to various relevant applications, from natural language understanding to creative content generation. While exploring use cases across different industries, it also honestly addresses the current limitations of LLMs and anticipates future advancements.
Introducción al Aprendizaje Automatico con H2O-3 (1)Sri Ambati
En esta reunión virtual, damos una introducción a la plataforma de aprendizaje automático de código abierto número 1, H2O-3 y te mostramos cómo puedes usarla para desarrollar modelos para resolver diferentes casos de uso.
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...Sri Ambati
Numerai is an open, crowd-sourced hedge fund powered by predictions from data scientists around the world. In return, participants are rewarded with weekly payouts in crypto.
In this talk, Joe will give an overview of the Numerai tournament based on his own experience. He will then explain how he automates the time-consuming tasks such as testing different modelling strategies, scoring new datasets, submitting predictions to Numerai as well as monitoring model performance with H2O Driverless AI and R.
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...Sri Ambati
In this session, you will learn about what you should do after you’ve taken an AI transformation baseline. Over the span of this session, we will discuss the next steps in moving toward AI readiness through alignment of talent and tools to drive successful adoption and continuous use within an organization.
To find additional videos on AI courses, earn badges, join the courses at H2O.ai Learning Center: https://training.h2o.ai/products/ai-foundations-course
To find the Youtube video about this presentation: https://youtu.be/K1Cl3x3rd8g
Speaker:
Chemere Davis (H2O.ai - Senior Data Scientist Training Specialist)
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
👨🏫👨💻 Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
Welocme to ViralQR, your best QR code generator.ViralQR
Welcome to ViralQR, your best QR code generator available on the market!
At ViralQR, we design static and dynamic QR codes. Our mission is to make business operations easier and customer engagement more powerful through the use of QR technology. Be it a small-scale business or a huge enterprise, our easy-to-use platform provides multiple choices that can be tailored according to your company's branding and marketing strategies.
Our Vision
We are here to make the process of creating QR codes easy and smooth, thus enhancing customer interaction and making business more fluid. We very strongly believe in the ability of QR codes to change the world for businesses in their interaction with customers and are set on making that technology accessible and usable far and wide.
Our Achievements
Ever since its inception, we have successfully served many clients by offering QR codes in their marketing, service delivery, and collection of feedback across various industries. Our platform has been recognized for its ease of use and amazing features, which helped a business to make QR codes.
Our Services
At ViralQR, here is a comprehensive suite of services that caters to your very needs:
Static QR Codes: Create free static QR codes. These QR codes are able to store significant information such as URLs, vCards, plain text, emails and SMS, Wi-Fi credentials, and Bitcoin addresses.
Dynamic QR codes: These also have all the advanced features but are subscription-based. They can directly link to PDF files, images, micro-landing pages, social accounts, review forms, business pages, and applications. In addition, they can be branded with CTAs, frames, patterns, colors, and logos to enhance your branding.
Pricing and Packages
Additionally, there is a 14-day free offer to ViralQR, which is an exceptional opportunity for new users to take a feel of this platform. One can easily subscribe from there and experience the full dynamic of using QR codes. The subscription plans are not only meant for business; they are priced very flexibly so that literally every business could afford to benefit from our service.
Why choose us?
ViralQR will provide services for marketing, advertising, catering, retail, and the like. The QR codes can be posted on fliers, packaging, merchandise, and banners, as well as to substitute for cash and cards in a restaurant or coffee shop. With QR codes integrated into your business, improve customer engagement and streamline operations.
Comprehensive Analytics
Subscribers of ViralQR receive detailed analytics and tracking tools in light of having a view of the core values of QR code performance. Our analytics dashboard shows aggregate views and unique views, as well as detailed information about each impression, including time, device, browser, and estimated location by city and country.
So, thank you for choosing ViralQR; we have an offer of nothing but the best in terms of QR code services to meet business diversity!
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Sv big datascience_cliffclick_5_2_2013
1. Big Data for
Big Questions
Cliff Click, CTO 0xdata
cliffc@0xdata.com
http://0xdata.com
http://cliffc.org/blog
2. ● Motivation: What & Why Big Math?
● Better Mousetrap
● Demo
● Fork: Deep Dive into
Math Hacking ...or...
K/V Store
Source: https://github.com/0xdata/h2o
5. 0xdata.com 5
42!
What was the question again?
Oh yeah, it was:
● How do I place ads based on a clickstream?
6. 0xdata.com 6
42!
What was the question again?
Oh yeah, it was:
● How do I place ads based on a clickstream?
● Detect fraud in a credit-card swipe stream?
7. 0xdata.com 7
42!
What was the question again?
Oh yeah, it was:
● How do I place ads based on a clickstream?
● Detect fraud in a credit-card swipe stream?
● Detect cancer from sensor data?
8. 0xdata.com 8
42!
What was the question again?
Oh yeah, it was:
● How do I place ads based on a clickstream?
● Detect fraud in a credit-card swipe stream?
● Detect cancer from sensor data?
● Predict equipment failure ahead of time?
9. 0xdata.com 9
42!
What was the question again?
Oh yeah, it was:
● How do I place ads based on a clickstream?
● Detect fraud in a credit-card swipe stream?
● Detect cancer from sensor data?
● Predict equipment failure ahead of time?
● Find people (un)like me?
● ... or ... or ... or... ????
10. 0xdata.com 10
How do I figure it all out?
● Well... what are my tools?
● Domain Knowledge,
● (me! The Expert)
● Math & Science! Data Science, and
● Data – lots and lots and lots of it
● Old logs, new logs, databases, historical
records, click-streams, CSV files, dumps
● Often TB's, sometimes PB's of it
11. 0xdata.com 11
Data: The Main Player
● Data: I got lots of it
● But it's a messy mixed-up lot
● Stored in HDFS, S3, DB2 or scattered about
● Incompatible formats, older & newer bits
● Missing stuff, or "known broken" fields
● And it's Big
● Too big for my laptop, or even one server
12. 0xdata.com 12
Data: Cleaning it Up
● Just the parts I want:
● SQL, Hive, HBase, grep
● Data is Big, so this is slow
● Wrong format:
● Awk, shell scripts, files, disk-to-disk
● Inspection (do I got it right yet?)
● Grep/awk, histograms, plots/prints
● Visualization tools
13. 0xdata.com 13
From Facts to Knowledge
● Data cleaned up: lots of neat rows of facts
● Lots of rows: millions and billions ...
● But facts is not knowledge
● Too much to "get it" by looking
● Time for a mathematical Model!
● Here again, Big limits my tools
● Either can't deal, or deal very very slowly
14. 0xdata.com 14
Modeling: math(data)
● Modeling gives a simpler view
● A way to understand
● And predict in real time
● Modeling is Math!
● Generalized Linear Modeling
– Oldest, most well known & used
● Random Forest
● K-Means Clustering
15. 0xdata.com 15
Big Data vs Modeling
● Model: a concise description of my data
● A more accurate model predicts better
● Generally More Data builds a better Model
● But only if the tool can handle it
● (some datasets are not helped but it rarely hurts)
● Tools can't handle Big: so down sample,
and use better (more complex) algorithm
16. 0xdata.com 16
Big Data vs Better Algorithm
● Don't want to choose Big vs Better
● Down sampling loses information
● Want a way to manipulate Big Data like it's
small: interactive & fast. Subtle when I
need it and brute force when I don't
● Build the Better Algorithm and use Big Data
● Seeing 10x more data yield prediction
increases e.g. from 75% to 85%
17. 0xdata.com 17
Building The Better
Big Data Mousetrap
● Want fast: means dram instead of disk
● Fall back to disk, if data >>> dram
● Want fast: use all cpus
● Problems are mostly data-parallel anyways
● Want ease-of-programming:
● “parallelism without effort”
● Well understood programming model
18. 0xdata.com 18
● Want ease-of-use:
● python, json, REST/HTML interfaces
● Full R semantics (via fastr project)
● Data ingest:
● where: HDFS, S3, NFS, URL, URI, browser
● what: csv, hive, rdata
Building The Better
Big Data Mousetrap
19. 0xdata.com 19
Building The Better
Big Data Mousetrap
● Want ease-of-admin:
● e.g. java -jar h2o.jar
● auto-cluster (no config at all) or hadoop Job
● Want ease-of-upgrade:
adding more servers gives
● More CPU (faster exec)
● More DRAM (larger data in dram)
● More network/disk bandwidth (faster ingest)
20. 0xdata.com 20
H2O: An Engine for Big Math
● Built in layers – pick your abstraction level
● Analysts, starters: REST, browser
– "clicky clicky" load data, build model, score
● Scientists: R, JSON, python to drive engine
– Complex math
● Math hackers: building new algos
– Full (distributed) Java Memory Model
– "codes like Java, runs distributed"
● Core Engineering: call us, we're hiring
21. 0xdata.com 21
Core Engineering: K/V Store
● Classic distributed Key/Value store
● get/put/atomic-transaction
● Full JMM semantics, exact consistency
● Full caching as-needed
– Cached keys "get" in 150 nano's
– Misses limited by network speed
● Hardware-like cache coherency protocol
● Distributed fork/join (thanks Doug Lea)
22. 0xdata.com 22
Core Engineering: D/F/J
● Distributed fork/join (jsr 166y)
● Recursive-descent for data-parallel
● Distribution handled by the core
– Log-tree scatter/gather across cluster
● Supports map/reduce-style directly
● But also "do this on all nodes" style
● Or random graph hacking
23. 0xdata.com 23
Math Hacking
● “Tastes like (distributed) java”
(actual inner loop, auto-parallel, auto-distributed)
● Big “vector math” is easy
● The obvious for-loop "just works"
for( int i=0; i<rows; i++ ) {
double X = ary.datad(bits,i,A);
double Y = ary.datad(bits,i,B);
_sumX += X;
_sumY += Y;
_sumX2+= X*X;
}
24. 0xdata.com 24
Math Hacking
● Dense-vector algorithms are easy
● Generalized Linear Modeling: 2 weeks
● K-means: 2 days
● Histogram: 2 hours
● Random Forest: not dense vectors
● Still makes good use of D/F/J
● All-CPUs, all-nodes still light up
– Very fast tree building
25. 0xdata.com 25
Science: dancing with the data
● Like the belle of the ball, the main algos
(GLM, k-means, RF) only arrive when the
data is properly dressed
● Munging data: dropping junk columns,
replacing missing bits, adding features
● H2O provides a tool-kit
● Big vector calculator: "d := a+b*c"
● dram speeds: "msec per Gbyte"
26. 0xdata.com 26
Science: APIs
● Need to script, automate repetitive tasks
● R via fastr and bigmemory package
● Full R semantics, 5x R speed single-thread
● But your vectors can be very very big...
● https://github.com/allr/fastr
● REST / URL / JSON
● Drive from e.g. python, scripts, curl, wget
– e.g. h2o testing harness is all python
27. 0xdata.com 27
Demos & Quick Starts
● Full browser interface
● Tutorials
● Handful of clicks to run e.g. RF or GLM
on gigabytes of data
● Auto-cluster in seconds
● On EC2 (or your laptops right now)
● Good enough for serious work
● (and have customers using this interface!)
29. 0xdata.com 29
H2O: An Engine for Big Math
● Focus on Big Math
● Easy to extend via M/R or K/V programming
● Auto-cluster
● Data-parallel exec across all CPUs
● dram caching across all servers
● Parallel ingest across all servers
● Open source: https://github.com/0xdata/h2o
0xdata.com
30. 0xdata.com 30
Math Hacking: The M/R API
● Make a 'golden object'
● Will be endlessly replicated across cluster
● Set 'input' fields:
– Auto-serialized, distributed
– Shallow-copy on nodes: eg arrays share state
● golden.map(key_1mb)
● map() called on clone for each 1mb
● Set 'output' fields now
31. 0xdata.com 31
Math Hacking: The M/R API
● gold.reduce(gold)
● Combine pairs of 'golden' objects
● Both locally and remotely (distributed)
● Log-tree roll-up
● 'output' fields will be shipped over the wire
● null-out 'input' fields
● transient marker available
32. 0xdata.com 32
Math Hacking: Example
CalcSumsTask cst = new CalcSumsTask();
cst._arykey = ary._key; // BigData Table key
cst._colA = colA; // integer indices to columns
cst._colB = colB;
cst.invoke(ary._key); // Do It!
// Results returned directly in 'cst' object
...cst._sumX... // use results
public static class CalcSumsTask extends MRTask {
Key _arykey; // BigData Table key
int _colA, _colB; // Column indices to work on
double _sumX,_sumY,_sumX2; // Sum of X's, Y's, X^2's
33. 0xdata.com 33
Math Hacking: Example
public static class CalcSumsTask extends MRTask {
Key _arykey; // BigData Table key
int _colA, _colB; // Column indices to work on
double _sumX,_sumY,_sumX2; // Sum of X's, Y's, X^2's
// map called for every 1Mb of data, or so
public void map( Key key1Mb ) {
… boiler plate... // lots of unimportant details
// Standard for-loop over the data
for( int i=0; i<rows; i++ ) {
double X = ary.datad(bits,i,A);
double Y = ary.datad(bits,i,B);
_sumX += X;
_sumY += Y;
_sumX2+= X*X;
}
}
34. 0xdata.com 34
Math Hacking: Example
public static class CalcSumsTask extends MRTask {
Key _arykey; // BigData Table key
int _colA, _colB; // Column indices to work on
double _sumX,_sumY,_sumX2; // Sum of X's, Y's, X^2's
// reduce called between pairs of golden objects
// always reduce right-side into 'this' object
public void reduce( DRemoteTask rt ) {
CalcSumsTask cst = (CalcSumsTask)rt;
_sumX += cst._sumX ;
_sumY += cst._sumY ;
_sumX2+= cst._sumX2;
}
}
35. 0xdata.com 35
A Fast K/V Store
● Distributed in-memory K/V Store
● Peer-to-peer, no master
● Full JMM semantics, get/put/atomic/remove
● Hardware-style cache-coherency protocol
● Fast: 150nanos for cache-hitting 'get'
● Fast: 50micros for cache-missing 'put'
● No persistence (see above for 'fast')
● No locks: use 'atomic' instead
36. 0xdata.com 36
K/V Design Goals
● JMM semantics on all get/put
● Cache-hitting 'gets' as fast as possible
● Local hashtable lookup + few tests
● 'puts' as lazy as possible (still JMM)
● Typically do not block for remote put
● Arbitrary transactions on single Keys
37. 0xdata.com 37
K/V Coherency Protocol
● Many are possible
● Picked a {fast-enough,easy} one
● Faster is possible
● Every Key has 1 master node
● And everybody knows it from Key hash
● Master orders racing writes
● Winner of NBHM insert
38. 0xdata.com 38
K/V Coherency Protocol
● Master tracks replicas
● Single CAS update
● Invalidate replicas on update
● Single CAS required, plus the invalidates
● Cache miss on replica will reload
● Interlocking get/put races solved with
finite state machine
41. 0xdata.com 41
The Expert
● Domain Expert:
● What data is useful, which is trash
● What needs help to become useful
● Missing elements? Toss outliers?
● Build new features from old?
● All through this process Big Data is, well,
Big, hence Slow to cp / awk / grep
● And Big limits my tools