Key note big data analytics ecosystem strategyIBM Sverige
This document discusses IBM's analytics portfolio and vision. It provides an overview of Big Data trends, how Watson has evolved to be faster and smaller, and the need for real-time analytics. It also discusses IBM's approach to Big Data challenges like volume, velocity, variety and veracity. The document outlines IBM's analytics platform capabilities including accelerators, information integration, governance, and Hadoop solutions. It highlights the evolution of IBM Netezza and DB2 in analytics and how IBM is committed to helping clients succeed with Big Data.
BM Cloudant is a NoSQL Database-as-a-Service. Discover how you can outsource the data layer of your mobile or web application to Cloudant to provide high availability, scalability and tools to take you to the next level.
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQueryChris Schalk
This document introduces several new Google cloud technologies: Google Storage for storing data in Google's cloud, the Prediction API for machine learning and predictive analytics, and BigQuery for interactive analysis of large datasets. It provides overviews and examples of using each service, highlighting their capabilities for scalable data storage, predictive modeling, and fast querying of massive amounts of data.
Building a Real-Time Gaming Analytics Service with Apache DruidImply
At GameAnalytics we receive and process real time behavioural data from more than 100 million daily active users, helping thousands of game studios and developers understand user behaviour and improve their games. In this talk, you will learn how we managed to migrate our legacy backend system from using an in-house built streaming analytics service to Apache Druid, and the lessons learned along the way. By adopting Druid, we have been able to reduce development costs, increase reliability of our systems and implement new features that would have not been possible with our old stack. We will provide an overview of our approach to schema design, segments optimization, creation of our query layer, caching and datasources optimisation, which can help you better understand how you can successfully use Druid as a key component on your data processing and reporting infrastructure.
Analyze and visualize non-relational data with DocumentDB + Power BISriram Hariharan
The session will show how to do Analyze and visualize non-relational data with DocumentDB + Power BI. We are in the midst of a paradigm shift on how we store and analyze data. Unstructured or flexible schema data represents a large portion of data within an organization. Everyone is obsessed to turn this data into meaningful business information. Unstructured data analytics do not need to be time consuming and complex. Come learn how to analyze and visualize unstructured data in DocumentDB.
The document discusses building a CouchDB application to store human protein data, describing how each protein document would contain information like name, sequence, and other defining features extracted from public databases. It provides an example protein document to demonstrate the type of data that would be stored.
As Twitch grew, both the amount of data we received and the number of employees interested in the data grew rapidly. In order to continue empowering decision making as we scaled, we turned to using Druid and Imply to provide self service analytics to both our technical and non technical staff allowing them to drill into high level metrics in lieu of reading generated reports.
In this talk, learn how Twitch implemented a common analytics platform for the needs of many different teams supporting hundreds of users, thousands of queries, and ~5 billion events each day. This session will explain our Druid architecture in detail, including:
-The end-to-end architecture deployed on Amazon that includes Kinesis, RDS, S3, Druid, Pivot and Tableau
-How the data is brought together to deliver a unified view of live customer engagement and historical trends
-Operational best practices we learnt scaling Druid
-An example walk through using the platform
Key note big data analytics ecosystem strategyIBM Sverige
This document discusses IBM's analytics portfolio and vision. It provides an overview of Big Data trends, how Watson has evolved to be faster and smaller, and the need for real-time analytics. It also discusses IBM's approach to Big Data challenges like volume, velocity, variety and veracity. The document outlines IBM's analytics platform capabilities including accelerators, information integration, governance, and Hadoop solutions. It highlights the evolution of IBM Netezza and DB2 in analytics and how IBM is committed to helping clients succeed with Big Data.
BM Cloudant is a NoSQL Database-as-a-Service. Discover how you can outsource the data layer of your mobile or web application to Cloudant to provide high availability, scalability and tools to take you to the next level.
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQueryChris Schalk
This document introduces several new Google cloud technologies: Google Storage for storing data in Google's cloud, the Prediction API for machine learning and predictive analytics, and BigQuery for interactive analysis of large datasets. It provides overviews and examples of using each service, highlighting their capabilities for scalable data storage, predictive modeling, and fast querying of massive amounts of data.
Building a Real-Time Gaming Analytics Service with Apache DruidImply
At GameAnalytics we receive and process real time behavioural data from more than 100 million daily active users, helping thousands of game studios and developers understand user behaviour and improve their games. In this talk, you will learn how we managed to migrate our legacy backend system from using an in-house built streaming analytics service to Apache Druid, and the lessons learned along the way. By adopting Druid, we have been able to reduce development costs, increase reliability of our systems and implement new features that would have not been possible with our old stack. We will provide an overview of our approach to schema design, segments optimization, creation of our query layer, caching and datasources optimisation, which can help you better understand how you can successfully use Druid as a key component on your data processing and reporting infrastructure.
Analyze and visualize non-relational data with DocumentDB + Power BISriram Hariharan
The session will show how to do Analyze and visualize non-relational data with DocumentDB + Power BI. We are in the midst of a paradigm shift on how we store and analyze data. Unstructured or flexible schema data represents a large portion of data within an organization. Everyone is obsessed to turn this data into meaningful business information. Unstructured data analytics do not need to be time consuming and complex. Come learn how to analyze and visualize unstructured data in DocumentDB.
The document discusses building a CouchDB application to store human protein data, describing how each protein document would contain information like name, sequence, and other defining features extracted from public databases. It provides an example protein document to demonstrate the type of data that would be stored.
As Twitch grew, both the amount of data we received and the number of employees interested in the data grew rapidly. In order to continue empowering decision making as we scaled, we turned to using Druid and Imply to provide self service analytics to both our technical and non technical staff allowing them to drill into high level metrics in lieu of reading generated reports.
In this talk, learn how Twitch implemented a common analytics platform for the needs of many different teams supporting hundreds of users, thousands of queries, and ~5 billion events each day. This session will explain our Druid architecture in detail, including:
-The end-to-end architecture deployed on Amazon that includes Kinesis, RDS, S3, Druid, Pivot and Tableau
-How the data is brought together to deliver a unified view of live customer engagement and historical trends
-Operational best practices we learnt scaling Druid
-An example walk through using the platform
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015NoSQLmatters
How do you monitor performance for one of your clients on a specific user segmentation when dealing with billions of events a day ? With over 2 billion ads served and 230Tb of data processed a day, we at Criteo have a comprehensive need for an interactive analytics stack. And by interactive, we mean a querying system with dynamic filtering to drill down over multiple dimensions, answering within sub-second latency. This session will take you on our journey with Druid, ""an open-source data store designed for real-time exploratory analytics on large data sets"". We will explore Druid's architecture and noticeable concepts, how relevant they are for some use cases and how it really performs.
in this presentation we go through the differences and similarities between Redshift and BigQuery. It was presented during the Athens Big Data meetup May 2017.
An short introduction on Big Query. With this presentation you'll quickly discover :
How load data in BigQuery
How to build dashboard using BigQuery
How to work with BigQuery
and, at last but not least, we've added some best practices
We hope you'll enjoy this presentation and that it will help you to start exploring this wonderful solution. Don't hesitate to send us your feedbacks or questions
Google BigQuery for Everyday DeveloperMárton Kodok
IV. IT&C Innovation Conference - October 2016 - Sovata, Romania
A. Every scientist who needs big data analytics to save millions of lives should have that power
Legacy systems don’t provide the power.
B. The simple fact is that you are brilliant but your brilliant ideas require complex analytics.
Traditional solutions are not applicable.
The Plan: have oversight over developments as they happen.
Goal: Store everything accessible by SQL immediately.
What is BigQuery?
Analytics-as-a-Service - Data Warehouse in the Cloud
Fully-Managed by Google (US or EU zone)
Scales into Petabytes
Ridiculously fast
Decent pricing (queries $5/TB, storage: $20/TB) *October 2016 pricing
100.000 rows / sec Streaming API
Open Interfaces (Web UI, BQ command line tool, REST, ODBC)
Familiar DB Structure (table, views, record, nested, JSON)
Convenience of SQL + Javascript UDF (User Defined Functions)
Integrates with Google Sheets + Google Cloud Storage + Pub/Sub connectors
Client libraries available in YFL (your favorite languages)
Our benefits
no provisioning/deploy
no running out of resources
no more focus on large scale execution plan
no need to re-implement tricky concepts
(time windows / join streams)
pay only the columns we have in your queries
run raw ad-hoc queries (either by analysts/sales or Devs)
no more throwing away-, expiring-, aggregating old data.
BigQuery is Google's columnar, massively parallel data querying solution. This talk explores using it as an ad-hoc reporting solution and the limitations present in May 2013.
Data persistence using pouchdb and couchdbDimgba Kalu
A presentation on data persistence using pouchdb and couchdb. This is a basic way of building offline data repository and efficient data synchronization
How TrafficGuard uses Druid to Fight Ad Fraud and BotsImply
In this session, TrafficGuard’s Head of Data Science, Raigon Jolly, will discuss how TrafficGuard uses Druid and its partnership with Imply to:
- Provide granular reporting to clients in near-real time
- Monitor rules and concept drift
- Staying ahead of the moving target that is ad fraud
- Facilitate performance tuning and right-sizing infrastructure so our team can focus on innovation of our core product
Gian will offer his reflections on the Druid journey to date, plus describe his vision for what Druid will become. He will lay out the near-term Druid roadmap and take your questions.
Watch video: https://imply.io/virtual-druid-summit/apache-druid-vision-and-roadmap-gian-merlino
MoPub, a Twitter company, provides monetization solutions for mobile app publishers and developers around the globe. MoPub receives over 33 Billion ad requests per day generating over 200TB of raw logs every day. We built MoPub Analytics as the analytics platform, using Druid + Imply for our end users who are Publishers, Demand side partners and Internal users.
We will talk about the architecture of the analytics platform, our Druid cluster setup, hardware choices, monitoring, use cases, limiting factors, challenges with lookups and solutions we used.
Watch video:https://imply.io/virtual-druid-summit/analytics-over-terabytes-of-data-at-twitter-apache-druid
Complex realtime event analytics using BigQuery @Crunch WarmupMárton Kodok
Complex event analytics solutions require massive architecture, and Know-How to build a fast real-time computing system. Google BigQuery solves this problem by enabling super-fast, SQL-like queries against append-only tables, using the processing power of Google’s infrastructure.In this presentation we will see how Bigquery solves our ultimate goal: Store everything accessible by SQL immediately at petabyte-scale. We will discuss some common use cases: funnels, user retention, affiliate metrics.
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...Lviv Startup Club
This document discusses different architectures and approaches for scalable business intelligence (BI) on cloud big data clusters. It covers OLAP approaches using relational databases, key-value stores like Apache Kylin, big data compute engines like Hive LLAP, and column stores like Druid. It also discusses in-memory aggregations with MemSQL and the results of testing different solutions. Traditional BI approaches on SQL servers are discussed alongside modern approaches using Hadoop, Spark, and cloud services from AWS, Azure, and Google Cloud.
The integration between Spring Framework and MongoDB tends to be somewhat unknown. This presentation shows the different projects that compose Spring ecosystem, Springdata, Springboot, SpringIO etc and how to merge between the pure JAVA projects to massive enterprise systems that require the interaction of these systems together.
Azure DocumentDB for Healthcare IntegrationBizTalk360
This document provides an overview of using Azure DocumentDB as a HL7 document repository for healthcare integration. It discusses DocumentDB features like JSON documents, indexing, CRUD operations, and querying. Example use cases for a HL7 document repository using DocumentDB are presented, including personal health records, document sharing, decision support, and patient demographics. The document concludes by previewing the design of an Azure API connector app for DocumentDB and a Logic App for HL7 FHIR.
Au cœur de la roadmap de la Suite ElasticElasticsearch
Découvrez les dernières fonctionnalités grâce à nos démos et annonces : réplication inter-clusters, indices gelés d'Elasticsearch, Kibana Spaces, et toujours plus d'intégrations de données dans Beats et Logstash.
Seeing at the Speed of Thought: Empowering Others Through Data ExplorationGreg Goltsov
This document appears to be a slide deck presentation on empowering others through data exploration. The presentation discusses removing barriers to data, making feedback fast, and removing yourself from blocking others. It emphasizes visualizing data pipelines and augmenting data warehouses with data lakes to handle varying data volumes, varieties, and velocities. The goal is to turn data into insights that create business value.
How to manage one million messages per second using Azure, Radu Vunvulea, ITD...Radu Vunvulea
At the beginning of a project it is simple to promise to clients different things, but when you need to prove them you might have discover that is impossible. Living in the IoT era we need to be able to process large amounts of content per second. This is why in this session we will see how we can construct a solution around Azure that can handle very easy 1M messages per second. We will start the session with a real time demo and we will continue to describe how we can construct such a system using Azure Services in less than 8h. Each Azure component that was used for the demo will be describes in detail and will see what are the pros and cons of it.
How to get the best of both: MongoDB is great for low latency quick access of recent data; Treasure Data is great for infinitely growing store of historical data. In the latter case, one need not worry about scaling.
SQL Server Reporting Services Disaster Recovery WebinarDenny Lee
This is the PASS DW/BI Webinar for SQL Server Reporting Services (SSRS) Disaster Recovery webinar. You can find the video at: http://www.youtube.com/watch?v=gfT9ETyLRlA
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015NoSQLmatters
How do you monitor performance for one of your clients on a specific user segmentation when dealing with billions of events a day ? With over 2 billion ads served and 230Tb of data processed a day, we at Criteo have a comprehensive need for an interactive analytics stack. And by interactive, we mean a querying system with dynamic filtering to drill down over multiple dimensions, answering within sub-second latency. This session will take you on our journey with Druid, ""an open-source data store designed for real-time exploratory analytics on large data sets"". We will explore Druid's architecture and noticeable concepts, how relevant they are for some use cases and how it really performs.
in this presentation we go through the differences and similarities between Redshift and BigQuery. It was presented during the Athens Big Data meetup May 2017.
An short introduction on Big Query. With this presentation you'll quickly discover :
How load data in BigQuery
How to build dashboard using BigQuery
How to work with BigQuery
and, at last but not least, we've added some best practices
We hope you'll enjoy this presentation and that it will help you to start exploring this wonderful solution. Don't hesitate to send us your feedbacks or questions
Google BigQuery for Everyday DeveloperMárton Kodok
IV. IT&C Innovation Conference - October 2016 - Sovata, Romania
A. Every scientist who needs big data analytics to save millions of lives should have that power
Legacy systems don’t provide the power.
B. The simple fact is that you are brilliant but your brilliant ideas require complex analytics.
Traditional solutions are not applicable.
The Plan: have oversight over developments as they happen.
Goal: Store everything accessible by SQL immediately.
What is BigQuery?
Analytics-as-a-Service - Data Warehouse in the Cloud
Fully-Managed by Google (US or EU zone)
Scales into Petabytes
Ridiculously fast
Decent pricing (queries $5/TB, storage: $20/TB) *October 2016 pricing
100.000 rows / sec Streaming API
Open Interfaces (Web UI, BQ command line tool, REST, ODBC)
Familiar DB Structure (table, views, record, nested, JSON)
Convenience of SQL + Javascript UDF (User Defined Functions)
Integrates with Google Sheets + Google Cloud Storage + Pub/Sub connectors
Client libraries available in YFL (your favorite languages)
Our benefits
no provisioning/deploy
no running out of resources
no more focus on large scale execution plan
no need to re-implement tricky concepts
(time windows / join streams)
pay only the columns we have in your queries
run raw ad-hoc queries (either by analysts/sales or Devs)
no more throwing away-, expiring-, aggregating old data.
BigQuery is Google's columnar, massively parallel data querying solution. This talk explores using it as an ad-hoc reporting solution and the limitations present in May 2013.
Data persistence using pouchdb and couchdbDimgba Kalu
A presentation on data persistence using pouchdb and couchdb. This is a basic way of building offline data repository and efficient data synchronization
How TrafficGuard uses Druid to Fight Ad Fraud and BotsImply
In this session, TrafficGuard’s Head of Data Science, Raigon Jolly, will discuss how TrafficGuard uses Druid and its partnership with Imply to:
- Provide granular reporting to clients in near-real time
- Monitor rules and concept drift
- Staying ahead of the moving target that is ad fraud
- Facilitate performance tuning and right-sizing infrastructure so our team can focus on innovation of our core product
Gian will offer his reflections on the Druid journey to date, plus describe his vision for what Druid will become. He will lay out the near-term Druid roadmap and take your questions.
Watch video: https://imply.io/virtual-druid-summit/apache-druid-vision-and-roadmap-gian-merlino
MoPub, a Twitter company, provides monetization solutions for mobile app publishers and developers around the globe. MoPub receives over 33 Billion ad requests per day generating over 200TB of raw logs every day. We built MoPub Analytics as the analytics platform, using Druid + Imply for our end users who are Publishers, Demand side partners and Internal users.
We will talk about the architecture of the analytics platform, our Druid cluster setup, hardware choices, monitoring, use cases, limiting factors, challenges with lookups and solutions we used.
Watch video:https://imply.io/virtual-druid-summit/analytics-over-terabytes-of-data-at-twitter-apache-druid
Complex realtime event analytics using BigQuery @Crunch WarmupMárton Kodok
Complex event analytics solutions require massive architecture, and Know-How to build a fast real-time computing system. Google BigQuery solves this problem by enabling super-fast, SQL-like queries against append-only tables, using the processing power of Google’s infrastructure.In this presentation we will see how Bigquery solves our ultimate goal: Store everything accessible by SQL immediately at petabyte-scale. We will discuss some common use cases: funnels, user retention, affiliate metrics.
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...Lviv Startup Club
This document discusses different architectures and approaches for scalable business intelligence (BI) on cloud big data clusters. It covers OLAP approaches using relational databases, key-value stores like Apache Kylin, big data compute engines like Hive LLAP, and column stores like Druid. It also discusses in-memory aggregations with MemSQL and the results of testing different solutions. Traditional BI approaches on SQL servers are discussed alongside modern approaches using Hadoop, Spark, and cloud services from AWS, Azure, and Google Cloud.
The integration between Spring Framework and MongoDB tends to be somewhat unknown. This presentation shows the different projects that compose Spring ecosystem, Springdata, Springboot, SpringIO etc and how to merge between the pure JAVA projects to massive enterprise systems that require the interaction of these systems together.
Azure DocumentDB for Healthcare IntegrationBizTalk360
This document provides an overview of using Azure DocumentDB as a HL7 document repository for healthcare integration. It discusses DocumentDB features like JSON documents, indexing, CRUD operations, and querying. Example use cases for a HL7 document repository using DocumentDB are presented, including personal health records, document sharing, decision support, and patient demographics. The document concludes by previewing the design of an Azure API connector app for DocumentDB and a Logic App for HL7 FHIR.
Au cœur de la roadmap de la Suite ElasticElasticsearch
Découvrez les dernières fonctionnalités grâce à nos démos et annonces : réplication inter-clusters, indices gelés d'Elasticsearch, Kibana Spaces, et toujours plus d'intégrations de données dans Beats et Logstash.
Seeing at the Speed of Thought: Empowering Others Through Data ExplorationGreg Goltsov
This document appears to be a slide deck presentation on empowering others through data exploration. The presentation discusses removing barriers to data, making feedback fast, and removing yourself from blocking others. It emphasizes visualizing data pipelines and augmenting data warehouses with data lakes to handle varying data volumes, varieties, and velocities. The goal is to turn data into insights that create business value.
How to manage one million messages per second using Azure, Radu Vunvulea, ITD...Radu Vunvulea
At the beginning of a project it is simple to promise to clients different things, but when you need to prove them you might have discover that is impossible. Living in the IoT era we need to be able to process large amounts of content per second. This is why in this session we will see how we can construct a solution around Azure that can handle very easy 1M messages per second. We will start the session with a real time demo and we will continue to describe how we can construct such a system using Azure Services in less than 8h. Each Azure component that was used for the demo will be describes in detail and will see what are the pros and cons of it.
How to get the best of both: MongoDB is great for low latency quick access of recent data; Treasure Data is great for infinitely growing store of historical data. In the latter case, one need not worry about scaling.
SQL Server Reporting Services Disaster Recovery WebinarDenny Lee
This is the PASS DW/BI Webinar for SQL Server Reporting Services (SSRS) Disaster Recovery webinar. You can find the video at: http://www.youtube.com/watch?v=gfT9ETyLRlA
Denny Lee introduced Azure DocumentDB, a fully managed NoSQL database service. DocumentDB provides elastic scaling of throughput and storage, global distribution with low latency reads and writes, and supports querying JSON documents with SQL and JavaScript. Common scenarios that benefit from DocumentDB include storing product catalogs, user profiles, sensor telemetry, and social graphs due to its ability to handle hierarchical and de-normalized data at massive scale.
SQL Server Integration Services Best PracticesDenny Lee
This is Thomas Kejser and my presentation at the Microsoft Business Intelligence Conference 2008 (October 2008) on SQL Server Integration Services Best Practices
This is an excerpt of the "Tier-1 BI in the World of Big Data" by Thomas Kejser, Denny Lee, and Kenneth Lieu specific to the Yahoo! TAO Case Study published at: http://www.microsoft.com/casestudies/Case_Study_Detail.aspx?CaseStudyID=710000001707
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...MongoDB
Drawn from Think Big's experience on real-world client projects, Think Big Academy Director and Principal Architect Jeffrey Breen will review specific ways to integrate NoSQL databases into Hadoop-based Big Data systems: preserving state in otherwise stateless processes; storing pre-computed metrics and aggregates to enable interactive analytics and reporting; and building a secondary index to provide low latency, random access to data stored stored on the high latency HDFS. A working example of secondary indexing is presented in which MongoDB is used to index web site visitor locations from Omniture clickstream data stored on HDFS.
The document discusses modernizing a data warehouse using the Microsoft Analytics Platform System (APS). APS is described as a turnkey appliance that allows organizations to integrate relational and non-relational data in a single system for enterprise-ready querying and business intelligence. It provides a scalable solution for growing data volumes and types that removes limitations of traditional data warehousing approaches.
Big Data Analytics in the Cloud with Microsoft AzureMark Kromer
Big Data Analytics in the Cloud using Microsoft Azure services was discussed. Key points included:
1) Azure provides tools for collecting, processing, analyzing and visualizing big data including Azure Data Lake, HDInsight, Data Factory, Machine Learning, and Power BI. These services can be used to build solutions for common big data use cases and architectures.
2) U-SQL is a language for preparing, transforming and analyzing data that allows users to focus on the what rather than the how of problems. It uses SQL and C# and can operate on structured and unstructured data.
3) Visual Studio provides an integrated environment for authoring, debugging, and monitoring U-SQL scripts and jobs. This allows
Azure Data Factory is a cloud-based data integration service that orchestrates and automates the movement and transformation of data. In this session we will learn how to create data integration solutions using the Data Factory service and ingest data from various data stores, transform/process the data, and publish the result data to the data stores.
QuerySurge Slide Deck for Big Data Testing WebinarRTTS
This is a slide deck from QuerySurge's Big Data Testing webinar.
Learn why Testing is pivotal to the success of your Big Data Strategy .
Learn more at www.querysurge.com
The growing variety of new data sources is pushing organizations to look for streamlined ways to manage complexities and get the most out of their data-related investments. The companies that do this correctly are realizing the power of big data for business expansion and growth.
Learn why testing your enterprise's data is pivotal for success with big data, Hadoop and NoSQL. Learn how to increase your testing speed, boost your testing coverage (up to 100%), and improve the level of quality within your data warehouse - all with one ETL testing tool.
This information is geared towards:
- Big Data & Data Warehouse Architects,
- ETL Developers
- ETL Testers, Big Data Testers
- Data Analysts
- Operations teams
- Business Intelligence (BI) Architects
- Data Management Officers & Directors
You will learn how to:
- Improve your Data Quality
- Accelerate your data testing cycles
- Reduce your costs & risks
- Provide a huge ROI (as high as 1,300%)
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...MongoDB
Drawn from Think Big's experience on real-world client projects, Think Big Academy Director and Principal Architect Jeffrey Breen will review specific ways to integrate NoSQL databases into Hadoop-based Big Data systems: preserving state in otherwise stateless processes; storing pre-computed metrics and aggregates to enable interactive analytics and reporting; and building a secondary index to provide low latency, random access to data stored stored on the high latency HDFS. A working example of secondary indexing is presented in which MongoDB is used to index web site visitor locations from Omniture clickstream data stored on HDFS.
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol HARMAN Services
This document provides an introduction to Microsoft Azure HDInsight, including:
- An overview of HDInsight and how it is Microsoft's Hadoop distribution running in the cloud based on Hortonworks Data Platform.
- The architecture of HDInsight and how it is tightly integrated with Microsoft's technology stack.
- Examples of use cases for HDInsight like iterative data exploration, data warehousing on demand, and ETL automation.
Best practices to deliver data analytics to the business with power biSatya Shyam K Jayanty
Get your data to life with Power BI visualization and insights!
With the changing landscape of Power BI features it is essential to get hold of configuration and deployment practices within your data platform that will ensure you are on-par with compliance & security practices. In this session we will overview from the basics leading into advanced tricks on this landscape:
How to deploy Power BI?
How to implement configuration parameters and package BI features as a part of Office 365 roll out in your organisation?
What are newest features and enhancements on this Power BI landscape?
How to manage on-premise vs on-cloud connectivity?
How can you help and support the Power BI community as well?
Having said that within the objectives of this session, cloud computing is another aspect of this technology made is possible to get data within few clicks and ticks to the end-user. Let us review how to manage & connect on-premise data to cloud capabilities that can offer full advantage of data catalogue capabilities by keeping data secure as per Information Governance standards. Not just with nuts and bolts, performance is another aspect that every Admin is keeping up, let us look into few settings on how to maximize performance to optimize access to data as required. Gain understanding and insight into number of tools that are available for your Business Intelligence needs. There will be a showcase of events to demonstrate where to begin and how to proceed in BI world.
- D BI A Consulting
consulting@dbia.uk
This document provides an overview of how to build your own personalized search and discovery tool like Microsoft Delve by combining machine learning, big data, and SharePoint. It discusses the Office Graph and how signals across Office 365 are used to populate insights. It also covers big data concepts like Hadoop and machine learning algorithms. Finally, it proposes a high-level architectural concept for building a Delve-like tool using Azure SQL Database, Azure Storage, Azure Machine Learning, and presenting insights.
How to build your own Delve: combining machine learning, big data and SharePointJoris Poelmans
You are experiencing the benefits of machine learning everyday through product recommendations on Amazon & Bol.com, credit card fraud prevention, etc… So how can we leverage machine learning together with SharePoint and Yammer. We will first look into the fundamentals of machine learning and big data solutions and next we will explore how we can combine tools such as Windows Azure HDInsight, R, Azure Machine Learning to extend and support collaboration and content management scenarios within your organization.
This document discusses using Azure HDInsight for big data applications. It provides an overview of HDInsight and describes how it can be used for various big data scenarios like modern data warehousing, advanced analytics, and IoT. It also discusses the architecture and components of HDInsight, how to create and manage HDInsight clusters, and how HDInsight integrates with other Azure services for big data and analytics workloads.
Building an Enterprise-Scale Dashboarding/Analytics Platform Powered by the C...Imply
Target is one of the largest retailers in the United States, with brick-and-mortar stores in all 50 states and one of the most-visited ecommerce sites in the country. In addition to typical merchandising functions like assortment planning, pricing and inventory management, Target also operates a large supply chain, financial/banking operations and property management organizations. As a data-driven organization, we need a data analytics platform that can address the unique needs of each of these various business units, while scaling to hundreds of thousands of users and accommodating an ever-increasing amount of data.
In this talk we’ll cover why Target chose to create our own analytics platform and specifically how Druid makes this platform successful. We’ll cover how we utilize key features in Druid, such as union datasources, arbitrary granularities, real-time ingestion, complex aggregation expressions and lightning-fast query response to provide analytics to users at all levels of the organization. We’ll also cover how Druid’s speed and flexibility allow us to provide interactive analytics to front-line, edge-of-business consumers to address hundreds of unique use-cases across several business units.
This document provides an overview of big data fundamentals and considerations for setting up a big data practice. It discusses key big data concepts like the four V's of big data. It also outlines common big data questions around business context, architecture, skills, and presents sample reference architectures. The document recommends starting a big data practice by identifying use cases, gaining management commitment, and setting up a center of excellence. It provides an example use case of retail web log analysis and presents big data architecture patterns.
Real-time big data analytics based on product recommendations case studydeep.bi
We started as an ad network. The challenge was to recommend the best product (out of millions) to the right person in a given moment (thousands of users within a second). We have delivered 5 billion ad views since 24 months. To put it in the scale context: If we would serve 1 ad per second it will take 160 years to serve 5 billion ads.
So we needed a solution. SQL databases did not work. Popular NoSQL databases did not work. Standard data warehouse approaches (pre-aggregations, creating schemas) - did not work too.
Re-thinking all the problems with huge data streams flowing to us every second we have built a complete solution based on open-source technologies and fresh, smart ideas from our engineering team. It is called deep.bi and now we make it available to other companies.
deep.bi lets high-growth companies solve fast data problems by providing scalable, flexible and real-time data collection, enrichment and analytics.
It was built using:
- Node.js - API
- Kafka - collecting and distributing data
- Spark Streaming - ETL, data enrichments
- Druid - real-time analytics
- Cassandra - user events store
- Hadoop + Parquet + Spark - raw data store + ad-hoc queries
Big Data Analytics from Azure Cloud to Power BI MobileRoy Kim
This document discusses using Azure services for big data analytics and data insights. It provides an overview of Azure services like Azure Batch, Azure Data Lake, Azure HDInsight and Power BI. It then describes a demo solution that uses these Azure services to analyze job posting data, including collecting data using a .NET application, storing in Azure Data Lake Store, processing with Azure Data Lake Analytics and Azure HDInsight, and visualizing results in Power BI. The presentation includes architecture diagrams and discusses implementation details.
This document discusses the emergence of logical data warehouses and how they can help organizations address challenges posed by big data. A logical data warehouse takes a virtualized approach to integrating data from multiple sources like relational databases, NoSQL stores, and file systems. It provides a single, unified view of data while keeping the underlying systems decoupled. The document also describes how organizations can use techniques like data virtualization and offloading to optimize workloads between their enterprise data warehouse and Hadoop data lake. This helps reduce costs while improving query performance and resource utilization.
This is a run-through at a 200 level of the Microsoft Azure Big Data Analytics for the Cloud data platform based on the Cortana Intelligence Suite offerings.
So you got a handle on what Big Data is and how you can use it to find business value in your data. Now you need an understanding of the Microsoft products that can be used to create a Big Data solution. Microsoft has many pieces of the puzzle and in this presentation I will show how they fit together. How does Microsoft enhance and add value to Big Data? From collecting data, transforming it, storing it, to visualizing it, I will show you Microsoft’s solutions for every step of the way
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...Cisco DevNet
Data gravity is a reality when dealing with massive amounts and globally distributed systems. Processing this data requires distributed analytics processing across InterCloud. In this presentation we will share our real world experience with storing, routing, and processing big data workloads on Cisco Cloud Services and Amazon Web Services clouds.
Testing Big Data: Automated Testing of Hadoop with QuerySurgeRTTS
Are You Ready? Stepping Up To The Big Data Challenge In 2016 - Learn why Testing is pivotal to the success of your Big Data Strategy.
According to a new report by analyst firm IDG, 70% of enterprises have either deployed or are planning to deploy big data projects and programs this year due to the increase in the amount of data they need to manage.
The growing variety of new data sources is pushing organizations to look for streamlined ways to manage complexities and get the most out of their data-related investments. The companies that do this correctly are realizing the power of big data for business expansion and growth.
Learn why testing your enterprise's data is pivotal for success with big data and Hadoop. Learn how to increase your testing speed, boost your testing coverage (up to 100%), and improve the level of quality within your data - all with one data testing tool.
Azure Cosmos DB: Globally Distributed Multi-Model Database ServiceDenny Lee
Azure Cosmos DB is the industry's first globally distributed multi-model database service. Features of Cosmos DB include turn-key global distribution, elastic throughput and storage, multiple consistency models, and financially backed SLAs. As well, we are in preview for Table, Graph, and Spark Connector to Cosmos DB. Also includes healthcare scenarios!
SQL Server Reporting Services: IT Best PracticesDenny Lee
This is Lukasz Pawlowski and my presentation at the Microsoft Business Intelligence Conference 2008 (October 2008) on SQL Server Reporting Services: IT Best Practices
Introduction to Microsoft's Big Data Platform and Hadoop PrimerDenny Lee
This is my 24 Hour of SQL PASS (September 2012) presentation on Introduction to Microsoft's Big Data Platform and Hadoop Primer. All known as Project Isotope and HDInsight.
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)Denny Lee
This document discusses case studies using differential privacy to analyze sensitive data. It describes analyzing Windows Live user data to study web analytics and customer churn. Clinical researchers' perspectives on differential privacy were also examined. Researchers wanted unaffected statistics and the ability to access original data if needed. Future collaboration with OHSU aims to develop a healthcare template for applying differential privacy.
SQL Server Reporting Services Disaster Recovery webinarDenny Lee
This is the PASS DW|BI virtual chapter webinar on SQL Server Reporting Services Disaster Recovery with Ayad Shammout and myself - hosted by Julie Koesmarno (@mssqlgirl)
Building and Deploying Large Scale SSRS using Lessons Learned from Customer D...Denny Lee
This document discusses lessons learned from deploying large scale SQL Server Reporting Services (SSRS) environments based on customer scenarios. It covers the key aspects of success, scaling out the architecture, performance optimization, and troubleshooting. Scaling out involves moving report catalogs to dedicated servers and using a scale out deployment architecture. Performance is optimized through configurations like disabling report history and tuning memory settings. Troubleshooting utilizes logs, monitoring, and diagnosing issues like out of memory errors.
Designing, Building, and Maintaining Large Cubes using Lessons LearnedDenny Lee
This is Nicholas Dritsas, Eric Jacobsen, and my 2007 SQL PASS Summit presentation on designing, building, and maintaining large Analysis Services cubes
SQLCAT: A Preview to PowerPivot Server Best PracticesDenny Lee
The document discusses SQL Server Customer Advisory Team (SQLCAT) and their work on the largest and most complex SQL Server projects worldwide. It also discusses SQLCAT's sharing of technical content and driving of product requirements back into SQL Server based on customer needs. The document promotes an upcoming SQL Server Clinic where experts will be available to answer questions about architecting and designing future applications.
SQLCAT: Tier-1 BI in the World of Big DataDenny Lee
This document summarizes a presentation on tier-1 business intelligence (BI) in the world of big data. The presentation will cover Microsoft's BI capabilities at large scales, big data workloads from Yahoo and investment banks, Hadoop and the MapReduce framework, and extracting data out of big data systems into BI tools. It also shares a case study on Yahoo's advertising analytics platform that processes billions of rows daily from terabytes of data.
Jump Start into Apache Spark (Seattle Spark Meetup)Denny Lee
Denny Lee, Technology Evangelist with Databricks, will demonstrate how easily many Data Sciences and Big Data (and many not-so-Big Data) scenarios easily using Apache Spark. This introductory level jump start will focus on user scenarios; it will be demo heavy and slide light!
How Concur uses Big Data to get you to Tableau Conference On TimeDenny Lee
This is my presentation from Tableau Conference #Data14 as the Cloudera Customer Showcase - How Concur uses Big Data to get you to Tableau Conference On Time. We discuss Hadoop, Hive, Impala, and Spark within the context of Consolidation, Visualization, Insight, and Recommendation.
How Klout is changing the landscape of social media with Hadoop and BIDenny Lee
Updated from the Hadoop Summit slides (http://www.slideshare.net/Hadoop_Summit/klout-changing-landscape-of-social-media), we've included additional screenshots to help tell the whole story.
A primer on power pivot topology and configurationsDenny Lee
PowerPivot allows Excel workbooks containing large datasets to be hosted and rendered in SharePoint. It uses an in-memory database called VertiPaq to store and query data. The PowerPivot add-in allows connecting to various data sources from within an Excel workbook. When hosted in SharePoint, Excel Services renders the workbook, while the PowerPivot system service manages data refresh and usage analytics. PowerPivot supports on-premises SharePoint farms with load balancing and high availability of PowerPivot system components across multiple application servers.
Webinar: Designing a schema for a Data WarehouseFederico Razzoli
Are you new to data warehouses (DWH)? Do you need to check whether your data warehouse follows the best practices for a good design? In both cases, this webinar is for you.
A data warehouse is a central relational database that contains all measurements about a business or an organisation. This data comes from a variety of heterogeneous data sources, which includes databases of any type that back the applications used by the company, data files exported by some applications, or APIs provided by internal or external services.
But designing a data warehouse correctly is a hard task, which requires gathering information about the business processes that need to be analysed in the first place. These processes must be translated into so-called star schemas, which means, denormalised databases where each table represents a dimension or facts.
We will discuss these topics:
- How to gather information about a business;
- Understanding dictionaries and how to identify business entities;
- Dimensions and facts;
- Setting a table granularity;
- Types of facts;
- Types of dimensions;
- Snowflakes and how to avoid them;
- Expanding existing dimensions and facts.
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
Main news related to the CCS TSI 2023 (2023/1695)Jakub Marek
An English 🇬🇧 translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers.
The original Czech 🇨🇿 version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 .
The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
Skybuffer SAM4U tool for SAP license adoptionTatiana Kojar
Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool.
SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on integration of Salesforce with Bonterra Impact Management.
Interested in deploying an integration with Salesforce for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
5th LF Energy Power Grid Model Meet-up SlidesDanBrown980551
5th Power Grid Model Meet-up
It is with great pleasure that we extend to you an invitation to the 5th Power Grid Model Meet-up, scheduled for 6th June 2024. This event will adopt a hybrid format, allowing participants to join us either through an online Mircosoft Teams session or in person at TU/e located at Den Dolech 2, Eindhoven, Netherlands. The meet-up will be hosted by Eindhoven University of Technology (TU/e), a research university specializing in engineering science & technology.
Power Grid Model
The global energy transition is placing new and unprecedented demands on Distribution System Operators (DSOs). Alongside upgrades to grid capacity, processes such as digitization, capacity optimization, and congestion management are becoming vital for delivering reliable services.
Power Grid Model is an open source project from Linux Foundation Energy and provides a calculation engine that is increasingly essential for DSOs. It offers a standards-based foundation enabling real-time power systems analysis, simulations of electrical power grids, and sophisticated what-if analysis. In addition, it enables in-depth studies and analysis of the electrical power grid’s behavior and performance. This comprehensive model incorporates essential factors such as power generation capacity, electrical losses, voltage levels, power flows, and system stability.
Power Grid Model is currently being applied in a wide variety of use cases, including grid planning, expansion, reliability, and congestion studies. It can also help in analyzing the impact of renewable energy integration, assessing the effects of disturbances or faults, and developing strategies for grid control and optimization.
What to expect
For the upcoming meetup we are organizing, we have an exciting lineup of activities planned:
-Insightful presentations covering two practical applications of the Power Grid Model.
-An update on the latest advancements in Power Grid -Model technology during the first and second quarters of 2024.
-An interactive brainstorming session to discuss and propose new feature requests.
-An opportunity to connect with fellow Power Grid Model enthusiasts and users.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
Ivanti’s Patch Tuesday breakdown goes beyond patching your applications and brings you the intelligence and guidance needed to prioritize where to focus your attention first. Catch early analysis on our Ivanti blog, then join industry expert Chris Goettl for the Patch Tuesday Webinar Event. There we’ll do a deep dive into each of the bulletins and give guidance on the risks associated with the newly-identified vulnerabilities.
Generating privacy-protected synthetic data using Secludy and MilvusZilliz
During this demo, the founders of Secludy will demonstrate how their system utilizes Milvus to store and manipulate embeddings for generating privacy-protected synthetic data. Their approach not only maintains the confidentiality of the original data but also enhances the utility and scalability of LLMs under privacy constraints. Attendees, including machine learning engineers, data scientists, and data managers, will witness first-hand how Secludy's integration with Milvus empowers organizations to harness the power of LLMs securely and efficiently.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
Digital Marketing Trends in 2024 | Guide for Staying AheadWask
https://www.wask.co/ebooks/digital-marketing-trends-in-2024
Feeling lost in the digital marketing whirlwind of 2024? Technology is changing, consumer habits are evolving, and staying ahead of the curve feels like a never-ending pursuit. This e-book is your compass. Dive into actionable insights to handle the complexities of modern marketing. From hyper-personalization to the power of user-generated content, learn how to build long-term relationships with your audience and unlock the secrets to success in the ever-shifting digital landscape.
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Big Data, Bigger Brains
1. November 6-9, Seattle, WA
Big Data, Bigger Brains
How Klout Changed the Landscape of Social
Media with Hadoop and BI
Denny Lee
Microsoft
Dave Mariani
Klout
3. Klout’s Big Data makes all this possible
3
15 Social Networks Processed Every Day
769 Terabytes of Data Storage
200,000 Indexed Users Added Every Day
140,000,000 Users Indexed Every Day
12,000,000,000 Social Signals Processed Every Day
30,000,000,000 API Calls Delivered Every Month
1,080,000,000,000 Rows of Data In Data Warehouse
6. 6
What is Business Intelligence?
• Data Warehousing, OLAP, Dashboards, Reporting
• Ability to slice and dice data in an ad-hoc manner
• Getting the right data to the right people, at the right
time
• i.e. Now
7. Why Hadoop + BI?
Requirement
Hadoop
&
Hive
BI
Query
Engines
Capture & store all data Yes No
Support queries against detail data Yes No
Support interactive queries &
applications
No Yes
Support BI & visualization tools No Yes
8. An Example: Klout Event Tracker
1 Perform A|B Testing of User Flows
2 Optimize Registration Funnels
3 Monitor consumer engagement & retention (DAUs & MAUs)
4 Flexibly track and report on user generated events
9. A Flexible, Hierarchical Schema
Project:
Collection
of Events
Event:
Captured
User Action
Property
Type:
Attribute
Key
Property Value:
Attribute
Value
+K (Add a topic) event
Source,
Gender,
Location
Google Search
Male
SF
HomePage,
Actions,
Mobile iOS
10. Event Tracker Architecture
Warehouse
Instrument Collect Persist Query Report
Tracker API
Scala,
node.JS
Log Process
Flume
Cube
Analysis
Services
Klout UI
Scala,
AJAX UX
SELECT { [Measures].[Counter], [Measures].[PreviousPeriodCounter]}
ON COLUMNS,
NON EMPTY CROSSJOIN (
exists([Date].[Date].[Date].allmembers,
[Date].[Date].&[2012-05-19T00:00:00]:[Date].[Date].&[2012-06-
02T00:00:00]),
[Events].[Event].[Event].allmembers ) DIMENSION PROPERTIES
MEMBER_CAPTION
ON ROWS
FROM [ProductInsight]
WHERE ({[Projects].[Project].[plusK]})
event_log
tstamp string
project string
event string
session_id bigint
ks_uid bigint
ip string
json_keys array<string>
json_values
array<string>
json_text string
dt string
hr string
{
"project":"plusK",
"event":"spend",
"session_id":"0",
"ip":"50.68.47.158",
"kloutId":“123456",
“cookie_id":”123456",
"ref":"http://klout.com/",
"type":"add_topic",
"time":"1338366015"
}
will be saved in HDFS at:
/logs/events_tracking/2012-05-30/0100
insights3:9003/track/{"project":”plu
sK","event":”spend”,
"ks_uid":123456,”type":”add_topic"}
14. 14
Linked Server Connection
EXEC master.dbo.sp_addlinkedserver
@server = N'HiveDW', @srvproduct=N'HIVE',
@provider=N'MSDASQL', @datasrc=N'Hive DW',
@provstr=N'Provider=MSDASQL.1;
Persist Security Info=True;User ID=SSAS;
Password=P@assw0rd;
CREATE VIEW vw_tbl_HiveSample AS
SELECT * FROM OpenQuery(HiveDW, 'SELECT * FROM
HiveSampleTable;')
15. OpenQuery / Linked Server Thoughts
• On Paper – this probably shouldn’t work
• Yet, it works and its been running for three
months in production now
• It’s more stable than the original MySQL
connection
• Has the advantage that from the cube
perspective, the DSV “looks like” its connecting
to a SQL Server
15
16. Demo: Hadoop to BI Workflow
16
Custom BI
Application
Commercial
BI Tool
Command
Line
HiveQL
In Excel
17. Hadoop & BI Together:
Query Cube using a Custom App
17
21. HiveQL Example
SELECT
get_json_object(json_text,'$.sid') as sid,
get_json_object(json_text,'$.kloutId') as kloutId,
get_json_object(json_text,'$.v') as version,
get_json_object(json_text,'$.status') as status,
event
FROM bi.event_log
WHERE project='mobile-ios'
AND dt=20121027
AND event in ('api_error', 'api_timeout')
ORDER BY sid;
22. Hadoop & BI Together:
Query Hive using Excel
26
23. Hadoop & BI Together:
Digging deeper on influence
28
25. 30
So let’s dig in!
select * from openquery(hiveprodbi, 'select * from
bi_maxwell.actor_action where network_abbr = ''gp'' and
ks_uid = 1711 and actor_ks_uid = 477358')
It returns this data:
ks_uid service_uid
service_id tstamp
message_id action
actor_service_uid tstamp_type original_message_id
original_tstamp registered_flag actor_ks_uid
1711 103493459351957813291
13 1345073319000
NULL GOOGLE_PLUS_PLUSONES
100525279016049609152
ORIGINAL_CONTENT_CREATION
z12djbti1oelu1eff22lztphnqapsd25t04
1345073319000 1 477358
26. 31
Digging in further…
Took the original_message_id field and looked
it up on Google+'s API
Go
to: https://developers.google.com/+/api/latest/
activities/get
Enter the content_id of
'z12djbti1oelu1eff22lztphnqapsd25t04' into the
'activityId' field in the form
It will return this URL in the
payload: https://plus.google.com/+CaliLewis/pos
ts/JxAKBWXy3zZ
28. Why Hadoop + BI?
Requirement
Hadoop
&
Hive
BI
Query
Engines
Capture & store all data Yes No
Support queries against detail data Yes No
Support interactive queries & applications No Yes
Support BI & visualization tools No Yes
29. Best Practices and Lessons Learned
• Avoid using a traditional database management
system for staging purposes
• Use the SQL Server OpenQuery interface for
heterogeneous joins
• Leverage Hive user-defined functions (UDFs) to
transform complex data types, such as JSON,
into rows and columns that Transact-SQL can
understand
• Manage large dimensions by using Hive views
34
30. Best Practices and Lessons Learned
• Make sure Hive UDFs are permanent and visible
to the ODBC provider.
• Adding to the .hiverc file – HiveCLI only
• Converting the UDF to a built-in function and
recompiling the Hive code
• Updating the Hive function registry to add the
UDFs to the built-in function list.
• modify the FunctionRegistry class to register the
UDFs (defined in an hql file)
• add all the dependent jars in hive.aux.jars.path
properties.
35
31. Best Practices and Lessons Learned
• Pad zero-length string data
SELECT
State =
CASE
When state = 'empty' Then Null
Else state
END,
Country =
CASE
When country = 'empty' Then Null
Else country
END
FROM OpenQuery(HiveDW,'SELECT
CASE
WHEN LENGTH(state) = 0 then ''empty''
ELSE COALESCE(state, ''empty'')
END AS state,
CASE
WHEN LENGTH(country) = 0 then ''empty''
ELSE COALESCE(country, ''empty'')
END AS country
FROM HiveSampleTable')
36
33. November 6-9, Seattle, WA39
Thank you
for attending this session and
the 2012 PASS Summit in Seattle
Editor's Notes
Analysis Services does not ship with a HiveQL cartridge
In Tabular mode this is possible because many similarities between HiveQL and SQL (provided less complex queries)
Business Intelligence Development Studio and SQL Server Data Tools are unable to generate an appropriate DSV for ODBC data sources through MSDASQL
These tools use the .NET Framework Data Provider for OLE DB (System.Data.OleDb), which does not support the OLE DB Provider for ODBC
Copy this from notepad for demo:
CREATE TABLE mobile_ios_details_20120612 as
SELECT
get_json_object(json_text,'$.sid') as sid,
get_json_object(json_text,'$.inc') as inc,
get_json_object(json_text,'$.status') as status,
event
FROM bi.event_log
WHERE project='mobile-ios'
AND dt=20120612
AND get_json_object(json_text,'$.v')<>'1.5'
AND (event = 'api_error' OR event = 'api_timeout')
ORDER BY sid;
I don’t know who she is
I don’t follow her on Twitter / FB / etc
If I dig in I realize that she’s with Geek.tv so it is something I watch but still…
1. Don’t throw data away, leverage Hadoop (track users and events for a/b testing)
2. BI tools aggregate data, but we need to reach back to the detail to answer deeper questions (http codes)
3. Hadoop != interactive queries (combined proprietary data with detail)
4. Use open source, but don’t reinvent the wheel (BI tools are mature, valuable & complementary)
Leverage the best tool for the function or job