The document discusses integrating big data analytics and business intelligence (BI). It outlines 6 cornerstones for making Hadoop deployments successful: employing all data, using all analytic assets, providing self-service access, building collaboration, being open and extensible, and populating best-of-breed reporting tools. Big data analytics can optimize businesses by asking any question of large, diverse data sources, while traditional BI focuses on reporting structured data to answer specific questions. The best approach combines the strengths of both.
From Business Intelligence to Big Data - hack/reduce Dec 2014Adam Ferrari
Talk given on Dec. 3, 2014 at MIT, sponsored by Hack/Reduce. This talk looks at the history of Business Intelligence from first generation OLAP tools through modern Data Discovery and visualization tools. And looking forward, what can we learn from that evolution as numerous new tools and architectures for analytics emerge in the Big Data era.
A global overview of the new Oracle business intelligence suite 11g by Oracle.
As the word's first specialized partner for the BI Foundation we would to share these slides with you.
Finally, you don’t have to choose between aging legacy BI or limited data discovery tools because Birst is now available on SAP HANA. The combination of Birst’s agile Business Intelligence and the lightning fast performance of HANA enables you to analyze more data, more quickly than ever before leading to new insights on how to improve the performance of your organization.
AI and Marketing: Robot-proofing Your JobCall Sumo
Artificial Intelligence (AI) provides marketers with deep knowledge of consumer, clients and delivers the right message to the right person at the right time. Here are more depth information how AC affects on Marketing.
From Business Intelligence to Big Data - hack/reduce Dec 2014Adam Ferrari
Talk given on Dec. 3, 2014 at MIT, sponsored by Hack/Reduce. This talk looks at the history of Business Intelligence from first generation OLAP tools through modern Data Discovery and visualization tools. And looking forward, what can we learn from that evolution as numerous new tools and architectures for analytics emerge in the Big Data era.
A global overview of the new Oracle business intelligence suite 11g by Oracle.
As the word's first specialized partner for the BI Foundation we would to share these slides with you.
Finally, you don’t have to choose between aging legacy BI or limited data discovery tools because Birst is now available on SAP HANA. The combination of Birst’s agile Business Intelligence and the lightning fast performance of HANA enables you to analyze more data, more quickly than ever before leading to new insights on how to improve the performance of your organization.
AI and Marketing: Robot-proofing Your JobCall Sumo
Artificial Intelligence (AI) provides marketers with deep knowledge of consumer, clients and delivers the right message to the right person at the right time. Here are more depth information how AC affects on Marketing.
Big Data Analytic with Hadoop: Customer StoriesYellowfin
Why watch?
Looking to analyze your growing data assets to unlock real business benefits today? But, are you sick of all the Big Data hype and whoopla?
Watch this on-demand Webinar from Actian and Yellowfin – Big Data Analytics with Hadoop – to discover how we’re making Big Data Analytics fast and easy:
Learn how a telecommunications provider has already transformed its business using Big Data Analytics with Hadoop.
Hold on as we go from data in Hadoop to predictive analytics in just 40-minutes.
Learn how to combine Hadoop with the most advanced Big Data technologies, and world’s easiest BI solution, to quickly generate real business value from Big Data Analytics.
What will you learn?
Discover how Actian’s market-leading Big Data Analytics technologies, combined with Yellowfin’s consumer-oriented platform for reporting and analytics, makes generating value from Big Data Analytics faster and easier than you thought possible.
Join us as we demonstrate how to:
• Connect to, prepare and optimize Big Data in Hadoop for reporting and analytics.
• Perform predictive analytics on streaming Big Data: Learn how to empower all your analytics stakeholders to move from historical reports to predictive analytics and gain a sustainable competitive advantage.
• Communicate insights attained from Big Data: Optimize the value of your Big Data insights by learning how to effectively communicate analytical information to defined user groups and types.
This Webinar is ideal if…
• You want to act on more data and data types in shorter timeframes
• You want to understand the steps involved in achieving Big Data success – both front and back end
• You want to see how market leaders are leveraging Big Data to become data-driven organizations today
Looking to analyze and exploit Big Data assets stored in Hadoop? Then this Webinar is a must.
Discover how to boost your reporting and navigate Sage 300 faster and easier in this presentation. You can watch the full recording here: http://bit.ly/2qf1awF
Big Data LDN 2018: CONNECTING SILOS IN REAL-TIME WITH DATA VIRTUALIZATIONMatt Stubbs
Date: 14th November 2018
Location: Keynote Theatre
Time: 13:50 - 14:20
Speaker: Becky Smith
Organisation: Denodo
About: How many users inside and outside of your organization access your organization’s data? Dozens? Hundreds is probably more like it, each with their own structure and content requirements as well as different access rights. As a result, many organizations have witnessed the formation of “data delivery mills,” in various shapes and sizes. How does one create order and reliability in this world of chaotic data streams? Quite easily, if it’s done with data virtualization.
According to Gartner, "through 2020, 50% of enterprises will implement some form of data virtualization as one enterprise production option for data integration.” Data virtualization enables organizations to gain data insights from multiple, distributed data sources without the time-consuming processes of data extraction and loading. This allows for faster insights and fact-based decisions, which help business realize value sooner.
Join us to find out more about:
• What data virtualization actually means and how it differs from traditional data integration approaches.
• How you can connect and combine all your data in real-time, without compromising on scalability, security or governance.
• The benefits of data virtualization and its most important use cases.
Few years ago, business leaders used to adopt a set of techniques and technologies to transform raw data into meaningful information, which is called business intelligence (BI). And in today’s time, business leaders need a solution that is contextually responsive, which is exactly what Call Sumo provides.
#MITXData 2014 - Leveraging Self-Service Business Intelligence to Drive Marke...MITX
2014 MITX Data & Analytics Summit
"Leveraging Self-Service Business Intelligence to Drive Marketing Analytics & Insight"
Speaker: Carmen Taglienti (@carmtag), Business Intelligence & Data Management Practice Lead, Slalom Consulting
Advancements in the BI technology ecosystem and the application of these capabilities to marketing analytics has enabled better, faster, and more accurate insight. In addition to the advancements in technology, marketing organizations look to embrace analytics and put the tools that support them into the hands of the decision makers in a “self-service” way. Typically organizations adopt analytics (and the supporting technology) across the enterprise according to the principles of "the analytics driven organization." This session will introduce an Analytics Maturity model that enables an analytics-driven marketing organization to assess current proficiencies, and understand the capabilities required to achieve its desired state of analytics maturity. This discussion will also cover the alignment of technology solutions at the various levels of the Analytics Maturity model, as well as the drive toward “self-service,” easy to use analytics. Finally, the presenter will demonstrate the use of real-time data acquisition and analytics to drive marketing insight.
http://blog.mitx.org/2014-data-summmit/
It covers the basic of analytics, types of analytics, tools, and techniques of analytics, and a briefcase study to demonstrate the predictive analytics with decision tree algorithm of machine learning
The Present - the History of Business IntelligencePhocas Software
Learn the history of business intelligence in this three part series. In part one, we discussed how business intelligence software used to be (the past). In part two, we discuss business intelligence as it is in the present.
Big Data Analytic with Hadoop: Customer StoriesYellowfin
Why watch?
Looking to analyze your growing data assets to unlock real business benefits today? But, are you sick of all the Big Data hype and whoopla?
Watch this on-demand Webinar from Actian and Yellowfin – Big Data Analytics with Hadoop – to discover how we’re making Big Data Analytics fast and easy:
Learn how a telecommunications provider has already transformed its business using Big Data Analytics with Hadoop.
Hold on as we go from data in Hadoop to predictive analytics in just 40-minutes.
Learn how to combine Hadoop with the most advanced Big Data technologies, and world’s easiest BI solution, to quickly generate real business value from Big Data Analytics.
What will you learn?
Discover how Actian’s market-leading Big Data Analytics technologies, combined with Yellowfin’s consumer-oriented platform for reporting and analytics, makes generating value from Big Data Analytics faster and easier than you thought possible.
Join us as we demonstrate how to:
• Connect to, prepare and optimize Big Data in Hadoop for reporting and analytics.
• Perform predictive analytics on streaming Big Data: Learn how to empower all your analytics stakeholders to move from historical reports to predictive analytics and gain a sustainable competitive advantage.
• Communicate insights attained from Big Data: Optimize the value of your Big Data insights by learning how to effectively communicate analytical information to defined user groups and types.
This Webinar is ideal if…
• You want to act on more data and data types in shorter timeframes
• You want to understand the steps involved in achieving Big Data success – both front and back end
• You want to see how market leaders are leveraging Big Data to become data-driven organizations today
Looking to analyze and exploit Big Data assets stored in Hadoop? Then this Webinar is a must.
Discover how to boost your reporting and navigate Sage 300 faster and easier in this presentation. You can watch the full recording here: http://bit.ly/2qf1awF
Big Data LDN 2018: CONNECTING SILOS IN REAL-TIME WITH DATA VIRTUALIZATIONMatt Stubbs
Date: 14th November 2018
Location: Keynote Theatre
Time: 13:50 - 14:20
Speaker: Becky Smith
Organisation: Denodo
About: How many users inside and outside of your organization access your organization’s data? Dozens? Hundreds is probably more like it, each with their own structure and content requirements as well as different access rights. As a result, many organizations have witnessed the formation of “data delivery mills,” in various shapes and sizes. How does one create order and reliability in this world of chaotic data streams? Quite easily, if it’s done with data virtualization.
According to Gartner, "through 2020, 50% of enterprises will implement some form of data virtualization as one enterprise production option for data integration.” Data virtualization enables organizations to gain data insights from multiple, distributed data sources without the time-consuming processes of data extraction and loading. This allows for faster insights and fact-based decisions, which help business realize value sooner.
Join us to find out more about:
• What data virtualization actually means and how it differs from traditional data integration approaches.
• How you can connect and combine all your data in real-time, without compromising on scalability, security or governance.
• The benefits of data virtualization and its most important use cases.
Few years ago, business leaders used to adopt a set of techniques and technologies to transform raw data into meaningful information, which is called business intelligence (BI). And in today’s time, business leaders need a solution that is contextually responsive, which is exactly what Call Sumo provides.
#MITXData 2014 - Leveraging Self-Service Business Intelligence to Drive Marke...MITX
2014 MITX Data & Analytics Summit
"Leveraging Self-Service Business Intelligence to Drive Marketing Analytics & Insight"
Speaker: Carmen Taglienti (@carmtag), Business Intelligence & Data Management Practice Lead, Slalom Consulting
Advancements in the BI technology ecosystem and the application of these capabilities to marketing analytics has enabled better, faster, and more accurate insight. In addition to the advancements in technology, marketing organizations look to embrace analytics and put the tools that support them into the hands of the decision makers in a “self-service” way. Typically organizations adopt analytics (and the supporting technology) across the enterprise according to the principles of "the analytics driven organization." This session will introduce an Analytics Maturity model that enables an analytics-driven marketing organization to assess current proficiencies, and understand the capabilities required to achieve its desired state of analytics maturity. This discussion will also cover the alignment of technology solutions at the various levels of the Analytics Maturity model, as well as the drive toward “self-service,” easy to use analytics. Finally, the presenter will demonstrate the use of real-time data acquisition and analytics to drive marketing insight.
http://blog.mitx.org/2014-data-summmit/
It covers the basic of analytics, types of analytics, tools, and techniques of analytics, and a briefcase study to demonstrate the predictive analytics with decision tree algorithm of machine learning
The Present - the History of Business IntelligencePhocas Software
Learn the history of business intelligence in this three part series. In part one, we discussed how business intelligence software used to be (the past). In part two, we discuss business intelligence as it is in the present.
Bi isn't big data and big data isn't BI (updated)mark madsen
Big data is hyped, but isn't hype. There are definite technical, process and business differences in the big data market when compared to BI and data warehousing, but they are often poorly understood or explained. BI isn't big data, and big data isn't BI. By distilling the technical and process realities of big data systems and projects we can separate fact from fiction. This session examines the underlying assumptions and abstractions we use in the BI and DW world, the abstractions that evolved in the big data world, and how they are different. Armed with this knowledge, you will be better able to make design and architecture decisions. The session is sometimes conceptual, sometimes detailed technical explorations of data, processing and technology, but promises to be entertaining regardless of the level.
Yes, it’s about the data normally called “big”, but it’s not Hadoop for the database crowd, despite the prominent role Hadoop plays. The session will be technical, but in a technology preview/overview fashion. I won’t be teaching you to write MapReduce jobs or anything of the sort.
The first part will be an overview of the types, formats and structures of data that aren’t normally in the data warehouse realm. The second part will cover some of the basic technology components, vendors and architecture.
The goal is to provide an overview of the extent of data available and some of the nuances or challenges in processing it, coupled with some examples of tools or vendors that may be a starting point if you are building in a particular area.
BI congres 2014-5: from BI to big data - Jan Aertsen - PentahoBICC Thomas More
7de BI congres van het BICC-Thomas More: 3 april 2014
Reisverslag van Business Intelligence naar Big Data
De reisbranche is sterk in beweging. Deze presentatie zal een reis door klassieke en moderne BI bestemmingen zijn, toont een serie snapshots van verschillende use cases in de reisbranche. Tijdens de sessie benadrukken we de capaciteit en flexibiliteit die een BI-tool nodig heeft om u te begeleiden op uw reis van klassieke BI-implementaties naar de moderne big data uitdagingen .
Global Business Intelligence (BI) software vendor, Yellowfin, and Actian Corporation, pioneers of the record-breaking analytical database Vectorwise, will host a series of Big Data and BI Best Practices Webinars.
These are the slides from that presentation.
The Big Data & BI Best Practices Webinars and associated slides examine the phenomenal growth in business data and outline strategies for effectively, efficiently and quickly harnessing and exploring ‘Big Data’ for competitive advantage.
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...Amazon Web Services
The world is producing an ever increasing volume, velocity, and variety of big data. Consumers and businesses are demanding up-to-the-second (or even millisecond) analytics on their fast-moving data, in addition to classic batch processing. AWS delivers many technologies for solving big data problems. But what services should you use, why, when, and how? In this session, we simplify big data processing as a data bus comprising various stages: ingest, store, process, and visualize. Next, we discuss how to choose the right technology in each stage based on criteria such as data structure, query latency, cost, request rate, item size, data volume, durability, and so on. Finally, we provide reference architecture, design patterns, and best practices for assembling these technologies to solve your big data problems at the right cost.
A look at the evolution of analytics and its revolutionary potential to transform ordinary businesses, power new business models, enable innovation, and deliver greater value. http://www2.deloitte.com/us/en/pages/deloitte-analytics/articles/analytics-trends.html
Big Data visualization with Apache Spark and Zeppelinprajods
This presentation gives an overview of Apache Spark and explains the features of Apache Zeppelin(incubator). Zeppelin is the open source tool for data discovery, exploration and visualization. It supports REPLs for shell, SparkSQL, Spark(scala), python and angular. This presentation was made on the Big Data Day, at the Great Indian Developer Summit, Bangalore, April 2015
http://www.actian.com/
Watch Glen Rabie, CEO of Yellowfin, and Fred Gallahger, GM of Actian Vectorwise take you through 7 of the Best Practices for Big Data and BI.
Time to Fly - Why Predictive Analytics is Going MainstreamInside Analysis
The Briefing Room with Robin Bloor and Perceivant
Live Webcast on Nov. 20, 2012
When companies predict the future effectively, they almost always win. But barriers abound for corporate departments and mid-sized organizations that have limited capital, IT staff, or both. They often lack the resources to employ powerful predictive analytics, and instead can only rely on basic reporting capabilities. That situation is now changing, thanks to several market forces, such as software innovation, maturing methodologies, as well as competition from open-source offerings.
Check out this episode of The Briefing Room to hear veteran Analyst Dr. Robin Bloor, who will explain why predictive analytics is finally going mainstream, and what that means for companies looking to grow. He will be briefed by Brian Rowe of Perceivant, who will tout his company’s SaaS-based analytics platform, which was designed to streamline the workflow required to get significant lift from predictive algorithms. He'll also discuss the packaged services designed to help business users get up and running with the key procedures for building and managing predictive models.
http://www.insideanalysis.com
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...BigMine
Talk by Usama Fayyad at BigMine12 at KDD12.
Virtually all organizations are having to deal with Big Data in many contexts: marketing, operations, monitoring, performance, and even financial management. Big Data is characterized not just by its size, but by its Velocity and its Variety for which keeping up with the data flux, let alone its analysis, is challenging at best and impossible in many cases. In this talk I will cover some of the basics in terms of infrastructure and design considerations for effective an efficient BigData. In many organizations, the lack of consideration of effective infrastructure and data management leads to unnecessarily expensive systems for which the benefits are insufficient to justify the costs. We will refer to example frameworks and clarify the kinds of operations where Map-Reduce (Hadoop and and its derivatives) are appropriate and the situations where other infrastructure is needed to perform segmentation, prediction, analysis, and reporting appropriately – these being the fundamental operations in predictive analytics. We will thenpay specific attention to on-line data and the unique challenges and opportunities represented there. We cover examples of Predictive Analytics over Big Data with case studies in eCommerce Marketing, on-line publishing and recommendation systems, and advertising targeting: Special focus will be placed on the analysis of on-line data with applications in Search, Search Marketing, and targeting of advertising. We conclude with some technical challenges as well as the solutions that can be used to these challenges in social network data.
Hot Technologies of 2013 with Robin Bloor, Rick Sherman and IBM
Live Webcast June 19, 2013
http://www.insideanalysis.com
The promise of Hadoop can be seen in all kinds of ways -- the proliferation of open source projects; the virtually limitless applications of Big Data; the sheer number of vendors getting involved. But the real value only comes from a mature environment, and that's Hadoop 2.0. What are the component parts of a robust solution? How are today's cutting-edge organizations leveraging the power of Big Data?
Register for this episode of Hot Technologies to hear veteran Analysts Dr. Robin Bloor of The Bloor Group, and Rick Sherman of Athena IT Solutions, as they offer perspective on how the Hadoop movement is shaping up. Larry Weber of IBM will then offer his take on the tools and architecture necessary to tackle the new challenges posed by Big Data. He'll discuss IBM's latest big data offerings including IBM InfoSphere BigInsights, IBM InfoSphere Streams, and IBM InfoSphere Data Explorer, and IBM's vision for simplifying an organization's big data journey.
New Innovations in Information Management for Big Data - Smarter Business 2013IBM Sverige
Big data has changed the IT landscape. Learn how
your existing IIG investment, combined with our
latest innovations in integration and governance, is a
springboard to success with big data use cases that
unlock valuable new insights. Presenter: David Corrigan, Big Data Specialist, IBM
Big Data: InterConnect 2016 Session on Getting Started with Big Data AnalyticsCynthia Saracco
Learn how to get started with Big Data using a platform based on Apache Hadoop, Apache Spark, and IBM BigInsights technologies. The emphasis here is on free or low-cost options that require modest technical skills.
Building a Modern Analytic Database with Cloudera 5.8Cloudera, Inc.
Analytic workloads and the ability to determine “what happened” are some of the most common use cases across enterprises today - helping you understand and adapt based on changing trends. However, for most businesses today, they are only able to see a piece of the story. Analytics are limited by the amount of data able to be stored and ultimately accessed, it’s time-intensive to bring in new datasets or fit unstructured data into rigid schemas, and user access is constrained to a select few who must already know the questions they’re trying to answer.
It’s no surprise that big data is disrupting this modus operandi for analytics. A modern, Hadoop-based platform is designed to help businesses break free of these analytic limitations, providing a new kind of adaptive, high-performance analytic database. The recent release of Cloudera 5.8 continues to advance Cloudera Enterprise as the foundation for these analytic workloads.
Join Justin Erickson, Senior Director of Product Management at Cloudera, and Andy Frey, Chief Technology Officer at Marketing Associates, as they discuss:
-What technology is needed to build a modern analytic database with Hadoop
-What’s new with Cloudera 5.8
-How to align your teams around agile analytics
-Real world success from Marketing Associates
-What’s next for Cloudera Enterprise’s Analytic Database
An Introduction to Big data and problems associated with storing and analyzing big data and How Hadoop solves the problem with its HDFS and MapReduce frameworks. A little intro to HDInsight, Hadoop on windows azure.
Introduction: This workshop will provide a hands-on introduction to Machine Learning (ML) with an overview of Deep Learning (DL).
Format: An introductory lecture on several supervised and unsupervised ML techniques followed by light introduction to DL and short discussion what is current state-of-the-art. Several python code samples using the scikit-learn library will be introduced that users will be able to run in the Cloudera Data Science Workbench (CDSW).
Objective: To provide a quick and short hands-on introduction to ML with python’s scikit-learn library. The environment in CDSW is interactive and the step-by-step guide will walk you through setting up your environment, to exploring datasets, training and evaluating models on popular datasets. By the end of the crash course, attendees will have a high-level understanding of popular ML algorithms and the current state of DL, what problems they can solve, and walk away with basic hands-on experience training and evaluating ML models.
Prerequisites: For the hands-on portion, registrants must bring a laptop with a Chrome or Firefox web browser. These labs will be done in the cloud, no installation needed. Everyone will be able to register and start using CDSW after the introductory lecture concludes (about 1hr in). Basic knowledge of python highly recommended.
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
In a world with a myriad of distributed storage systems to choose from, the majority of Apache HBase clusters still rely on Apache HDFS. Theoretically, any distributed file system could be used by HBase. One major reason HDFS is predominantly used are the specific durability requirements of HBase's write-ahead log (WAL) and HDFS providing that guarantee correctly. However, HBase's use of HDFS for WALs can be replaced with sufficient effort.
This talk will cover the design of a "Log Service" which can be embedded inside of HBase that provides a sufficient level of durability that HBase requires for WALs. Apache Ratis (incubating) is a library-implementation of the RAFT consensus protocol in Java and is used to build this Log Service. We will cover the design choices of the Ratis Log Service, comparing and contrasting it to other log-based systems that exist today. Next, we'll cover how the Log Service "fits" into HBase and the necessary changes to HBase which enable this. Finally, we'll discuss how the Log Service can simplify the operational burden of HBase.
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
Utilizing Apache NiFi we read various open data REST APIs and camera feeds to ingest crime and related data real-time streaming it into HBase and Phoenix tables. HBase makes an excellent storage option for our real-time time series data sources. We can immediately query our data utilizing Apache Zeppelin against Phoenix tables as well as Hive external tables to HBase.
Apache Phoenix tables also make a great option since we can easily put microservices on top of them for application usage. I have an example Spring Boot application that reads from our Philadelphia crime table for front-end web applications as well as RESTful APIs.
Apache NiFi makes it easy to push records with schemas to HBase and insert into Phoenix SQL tables.
Resources:
https://community.hortonworks.com/articles/54947/reading-opendata-json-and-storing-into-phoenix-tab.html
https://community.hortonworks.com/articles/56642/creating-a-spring-boot-java-8-microservice-to-read.html
https://community.hortonworks.com/articles/64122/incrementally-streaming-rdbms-data-to-your-hadoop.html
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
Whilst HBase is the most logical answer for use cases requiring random, realtime read/write access to Big Data, it may not be so trivial to design applications that make most of its use, neither the most simple to operate. As it depends/integrates with other components from Hadoop ecosystem (Zookeeper, HDFS, Spark, Hive, etc) or external systems ( Kerberos, LDAP), and its distributed nature requires a "Swiss clockwork" infrastructure, many variables are to be considered when observing anomalies or even outages. Adding to the equation there's also the fact that HBase is still an evolving product, with different release versions being used currently, some of those can carry genuine software bugs. On this presentation, we'll go through the most common HBase issues faced by different organisations, describing identified cause and resolution action over my last 5 years supporting HBase to our heterogeneous customer base.
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
LocationTech GeoMesa enables spatial and spatiotemporal indexing and queries for HBase and Accumulo. In this talk, after an overview of GeoMesa’s capabilities in the Cloudera ecosystem, we will dive into how GeoMesa leverages Accumulo’s Iterator interface and HBase’s Filter and Coprocessor interfaces. The goal will be to discuss both what spatial operations can be pushed down into the distributed database and also how the GeoMesa codebase is organized to allow for consistent use across the two database systems.
OCLC has been using HBase since 2012 to enable single-search-box access to over a billion items from your library and the world’s library collection. This talk will provide an overview of how HBase is structured to provide this information and some of the challenges they have encountered to scale to support the world catalog and how they have overcome them.
Many individuals/organizations have a desire to utilize NoSQL technology, but often lack an understanding of how the underlying functional bits can be utilized to enable their use case. This situation can result in drastic increases in the desire to put the SQL back in NoSQL.
Since the initial commit, Apache Accumulo has provided a number of examples to help jumpstart comprehension of how some of these bits function as well as potentially help tease out an understanding of how they might be applied to a NoSQL friendly use case. One very relatable example demonstrates how Accumulo could be used to emulate a filesystem (dirlist).
In this session we will walk through the dirlist implementation. Attendees should come away with an understanding of the supporting table designs, a simple text search supporting a single wildcard (on file/directory names), and how the dirlist elements work together to accomplish its feature set. Attendees should (hopefully) also come away with a justification for sometimes keeping the SQL out of NoSQL.
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
Data serves as the platform for decision-making at Uber. To facilitate data driven decisions, many datasets at Uber are ingested in a Hadoop Data Lake and exposed to querying via Hive. Analytical queries joining various datasets are run to better understand business data at Uber.
Data ingestion, at its most basic form, is about organizing data to balance efficient reading and writing of newer data. Data organization for efficient reading involves factoring in query patterns to partition data to ensure read amplification is low. Data organization for efficient writing involves factoring the nature of input data - whether it is append only or updatable.
At Uber we ingest terabytes of many critical tables such as trips that are updatable. These tables are fundamental part of Uber's data-driven solutions, and act as the source-of-truth for all the analytical use-cases across the entire company. Datasets such as trips constantly receive updates to the data apart from inserts. To ingest such datasets we need a critical component that is responsible for bookkeeping information of the data layout, and annotates each incoming change with the location in HDFS where this data should be written. This component is called as Global Indexing. Without this component, all records get treated as inserts and get re-written to HDFS instead of being updated. This leads to duplication of data, breaking data correctness and user queries. This component is key to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. This component will need to have strong consistency and provide large throughputs for index writes and reads.
At Uber, we have chosen HBase to be the backing store for the Global Indexing component and is a critical component in allowing us to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. In this talk, we will discuss data@Uber and expound more on why we built the global index using Apache Hbase and how this helps to scale out our cluster usage. We’ll give details on why we chose HBase over other storage systems, how and why we came up with a creative solution to automatically load Hfiles directly to the backend circumventing the normal write path when bootstrapping our ingestion tables to avoid QPS constraints, as well as other learnings we had bringing this system up in production at the scale of data that Uber encounters daily.
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
Recently, Apache Phoenix has been integrated with Apache (incubator) Omid transaction processing service, to provide ultra-high system throughput with ultra-low latency overhead. Phoenix has been shown to scale beyond 0.5M transactions per second with sub-5ms latency for short transactions on industry-standard hardware. On the other hand, Omid has been extended to support secondary indexes, multi-snapshot SQL queries, and massive-write transactions.
These innovative features make Phoenix an excellent choice for translytics applications, which allow converged transaction processing and analytics. We share the story of building the next-gen data tier for advertising platforms at Verizon Media that exploits Phoenix and Omid to support multi-feed real-time ingestion and AI pipelines in one place, and discuss the lessons learned.
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
Cybersecurity requires an organization to collect data, analyze it, and alert on cyber anomalies in near real-time. This is a challenging endeavor when considering the variety of data sources which need to be collected and analyzed. Everything from application logs, network events, authentications systems, IOT devices, business events, cloud service logs, and more need to be taken into consideration. In addition, multiple data formats need to be transformed and conformed to be understood by both humans and ML/AI algorithms.
To solve this problem, the Aetna Global Security team developed the Unified Data Platform based on Apache NiFi, which allows them to remain agile and adapt to new security threats and the onboarding of new technologies in the Aetna environment. The platform currently has over 60 different data flows with 95% doing real-time ETL and handles over 20 billion events per day. In this session learn from Aetna’s experience building an edge to AI high-speed data pipeline with Apache NiFi.
In the healthcare sector, data security, governance, and quality are crucial for maintaining patient privacy and ensuring the highest standards of care. At Florida Blue, the leading health insurer of Florida serving over five million members, there is a multifaceted network of care providers, business users, sales agents, and other divisions relying on the same datasets to derive critical information for multiple applications across the enterprise. However, maintaining consistent data governance and security for protected health information and other extended data attributes has always been a complex challenge that did not easily accommodate the wide range of needs for Florida Blue’s many business units. Using Apache Ranger, we developed a federated Identity & Access Management (IAM) approach that allows each tenant to have their own IAM mechanism. All user groups and roles are propagated across the federation in order to determine users’ data entitlement and access authorization; this applies to all stages of the system, from the broadest tenant levels down to specific data rows and columns. We also enabled audit attributes to ensure data quality by documenting data sources, reasons for data collection, date and time of data collection, and more. In this discussion, we will outline our implementation approach, review the results, and highlight our “lessons learned.”
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Bloomberg, Comcast, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
Specialized tools for machine learning development and model governance are becoming essential. MlFlow is an open source platform for managing the machine learning lifecycle. Just by adding a few lines of code in the function or script that trains their model, data scientists can log parameters, metrics, artifacts (plots, miscellaneous files, etc.) and a deployable packaging of the ML model. Every time that function or script is run, the results will be logged automatically as a byproduct of those lines of code being added, even if the party doing the training run makes no special effort to record the results. MLflow application programming interfaces (APIs) are available for the Python, R and Java programming languages, and MLflow sports a language-agnostic REST API as well. Over a relatively short time period, MLflow has garnered more than 3,300 stars on GitHub , almost 500,000 monthly downloads and 80 contributors from more than 40 companies. Most significantly, more than 200 companies are now using MLflow. We will demo MlFlow Tracking , Project and Model components with Azure Machine Learning (AML) Services and show you how easy it is to get started with MlFlow on-prem or in the cloud.
Extending Twitter's Data Platform to Google CloudDataWorks Summit
Twitter's Data Platform is built using multiple complex open source and in house projects to support Data Analytics on hundreds of petabytes of data. Our platform support storage, compute, data ingestion, discovery and management and various tools and libraries to help users for both batch and realtime analytics. Our DataPlatform operates on multiple clusters across different data centers to help thousands of users discover valuable insights. As we were scaling our Data Platform to multiple clusters, we also evaluated various cloud vendors to support use cases outside of our data centers. In this talk we share our architecture and how we extend our data platform to use cloud as another datacenter. We walk through our evaluation process, challenges we faced supporting data analytics at Twitter scale on cloud and present our current solution. Extending Twitter's Data platform to cloud was complex task which we deep dive in this presentation.
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
At Comcast, our team has been architecting a customer experience platform which is able to react to near-real-time events and interactions and deliver appropriate and timely communications to customers. By combining the low latency capabilities of Apache Flink and the dataflow capabilities of Apache NiFi we are able to process events at high volume to trigger, enrich, filter, and act/communicate to enhance customer experiences. Apache Flink and Apache NiFi complement each other with their strengths in event streaming and correlation, state management, command-and-control, parallelism, development methodology, and interoperability with surrounding technologies. We will trace our journey from starting with Apache NiFi over three years ago and our more recent introduction of Apache Flink into our platform stack to handle more complex scenarios. In this presentation we will compare and contrast which business and technical use cases are best suited to which platform and explore different ways to integrate the two platforms into a single solution.
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
Companies are increasingly moving to the cloud to store and process data. One of the challenges companies have is in securing data across hybrid environments with easy way to centrally manage policies. In this session, we will talk through how companies can use Apache Ranger to protect access to data both in on-premise as well as in cloud environments. We will go into details into the challenges of hybrid environment and how Ranger can solve it. We will also talk through how companies can further enhance the security by leveraging Ranger to anonymize or tokenize data while moving into the cloud and de-anonymize dynamically using Apache Hive, Apache Spark or when accessing data from cloud storage systems. We will also deep dive into the Ranger’s integration with AWS S3, AWS Redshift and other cloud native systems. We will wrap it up with an end to end demo showing how policies can be created in Ranger and used to manage access to data in different systems, anonymize or de-anonymize data and track where data is flowing.
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
Advanced Big Data Processing frameworks have been proposed to harness the fast data transmission capability of Remote Direct Memory Access (RDMA) over high-speed networks such as InfiniBand, RoCEv1, RoCEv2, iWARP, and OmniPath. However, with the introduction of the Non-Volatile Memory (NVM) and NVM express (NVMe) based SSD, these designs along with the default Big Data processing models need to be re-assessed to discover the possibilities of further enhanced performance. In this talk, we will present, NRCIO, a high-performance communication runtime for non-volatile memory over modern network interconnects that can be leveraged by existing Big Data processing middleware. We will show the performance of non-volatile memory-aware RDMA communication protocols using our proposed runtime and demonstrate its benefits by incorporating it into a high-performance in-memory key-value store, Apache Hadoop, Tez, Spark, and TensorFlow. Evaluation results illustrate that NRCIO can achieve up to 3.65x performance improvement for representative Big Data processing workloads on modern data centers.
Background: Some early applications of Computer Vision in Retail arose from e-commerce use cases - but increasingly, it is being used in physical stores in a variety of new and exciting ways, such as:
● Optimizing merchandising execution, in-stocks and sell-thru
● Enhancing operational efficiencies, enable real-time customer engagement
● Enhancing loss prevention capabilities, response time
● Creating frictionless experiences for shoppers
Abstract: This talk will cover the use of Computer Vision in Retail, the implications to the broader Consumer Goods industry and share business drivers, use cases and benefits that are unfolding as an integral component in the remaking of an age-old industry.
We will also take a ‘peek under the hood’ of Computer Vision and Deep Learning, sharing technology design principles and skill set profiles to consider before starting your CV journey.
Deep learning has matured considerably in the past few years to produce human or superhuman abilities in a variety of computer vision paradigms. We will discuss ways to recognize these paradigms in retail settings, collect and organize data to create actionable outcomes with the new insights and applications that deep learning enables.
We will cover the basics of object detection, then move into the advanced processing of images describing the possible ways that a retail store of the near future could operate. Identifying various storefront situations by having a deep learning system attached to a camera stream. Such things as; identifying item stocks on shelves, a shelf in need of organization, or perhaps a wandering customer in need of assistance.
We will also cover how to use a computer vision system to automatically track customer purchases to enable a streamlined checkout process, and how deep learning can power plausible wardrobe suggestions based on what a customer is currently wearing or purchasing.
Finally, we will cover the various technologies that are powering these applications today. Deep learning tools for research and development. Production tools to distribute that intelligence to an entire inventory of all the cameras situation around a retail location. Tools for exploring and understanding the new data streams produced by the computer vision systems.
By the end of this talk, attendees should understand the impact Computer Vision and Deep Learning are having in the Consumer Goods industry, key use cases, techniques and key considerations leaders are exploring and implementing today.
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
Whole genome shotgun based next generation transcriptomics and metagenomics studies often generate 100 to 1000 gigabytes (GB) sequence data derived from tens of thousands of different genes or microbial species. De novo assembling these data requires an ideal solution that both scales with data size and optimizes for individual gene or genomes. Here we developed an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions the reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomics and metagenomics test datasets from both short read and long read sequencing technologies. It achieved a near linear scalability with respect to input data size and number of compute nodes. SpaRC can run on different cloud computing environments without modifications while delivering similar performance. In summary, our results suggest SpaRC provides a scalable solution for clustering billions of reads from the next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar big data genomics problems.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Blueprint for integrating big data analytics and bi
1. Big Data Insight
Blueprint for Integrating Big Data Analytics and BI
Abe Taha, VP Engineering
abetaha@karmasphere.com
www.karmasphere.com
2. Big Data Insight
> Agenda
ü Where does Big Data Analytics fit in the BI ecosystem
ü How does Big Data Analytics complement the type of analysis we do today using BI
ü What are clients doing with Big Data Analytics that they couldn’t do with BI
ü What do we need to think about to make Hadoop deployments successful
2 Karmasphere Proprietary and Confidential. Do Not Copy. Do Not Distribute
5. Big Data Insight
> The Best of Both Worlds = Big Data Analytics + Traditional BI
Traditional BI Big Data Analytics
Purpose Reporting on business Optimizing the business
Paradigm Ask a specific question Ask any question
Format Look at structured data Look at all data
Setup Pre-engineered On-the-fly
Data locations Siloed One place
Agility Weeks to months Almost Immediate
5 Karmasphere Proprietary and Confidential. Do Not Copy. Do Not Distribute
7. Big Data Insight
> What Hadoop Adopters Are Saying
“The kind of new stuff
we want to do
can’t get done with
BI“
Large Hi Tech Chip Manufacturer
7 Karmasphere Proprietary and Confidential. Do Not Copy. Do Not Distribute
8. Big Data Insight
> How to make Hadoop successful with BI
1. Employ All Data
2. Use All Analytic Assets
3. Provide Self-Service Access for All Users
4. Build a Collaborative Environment
5. Be Open and Extensible
6. Populate Best-of-Breed Reporting Tools
9. Big Data Insight
> Cornerstone 1: Employ All Data
ü Leave No Data Behind
• Raw unstructured – Web logs, machine /
sensor data, mobile social, video, etc.
• Structured data – traditional RDMBS, EDW’s
• Streaming vs. batch oriented
• Data governance and quality
10. Big Data Insight
> Cornerstone 2: Use All Analytic Assets
ü Employ All Analytic Assets
• Traditional models and assets
• Standard Hadoop components including
UDFs and SerDes
• Custom algorithms
• Models created in other systems such as
SAS/R
11. Big Data Insight
> Cornerstone 3: Provide Self-Service Access for All Users
ü Self-Service
• BYOD: Bring Your Own Data
• Ingest custom functions and algorithms
• Intuitive, no special skill sets required
ü Empower All Users and Skill Sets
• Business User
• Easy-to-use ad-hoc analysis, web-based forms
• Drag and drop
• Data Analysts
• Common skills: SQL
• Powerful iterative analysis
• Analytical models and algorithms
• Customers and Partners for ecosystem
12. Big Data Insight
> Cornerstone 4: Build a Collaborative Environment
ü Collaborative
• Project-based environment
• Leverage cross-functional skills
• Security and isolation
ü Social
• Share data and insights across teams
• Metadata, Queries, Results and Visualizations
• View colleague’s activities
• Usage feedback and metrics
13. Big Data Insight
> Cornerstone 5: Be Open and Extensible
ü Open
• Active community, rapid innovation
• Vendor commitment
• Standards based
• Portable - No vendor lock-in
• Expose standard API’s and interfaces
ü Extensible
• Add custom functions
• Reuse existing analytic models
• Add additional data sources by defining custom parsers
14. Big Data Insight
> Cornerstone 6: Populate Best-of-Breed Reporting Tools
ü Best-Of-Breed Reporting tools
• Ingest data from existing BI systems and ad hoc data including
Spreadsheet data
• Automate delivery of insights
• Push insights to RDBMS, EDW’s and MPP
• Expose standards APIs for programmability
15. Big Data Insight
> How would an architecture look
15 Karmasphere Proprietary and Confidential. Do Not Copy. Do Not Distribute
16. Big Data Insight
> Summary
1. Implement Big Data Analytics and BI co-existence Hadoop at your fingertips
2. Leverage all your assets
3. Use and build on open and extensible solutions across your company…
4. Build social and collaborative in early
Private and Confidential
17. Big Data Insight
> Summary Get the Best of Both Worlds – Build a Bridge Inside Your Company
Big Data Analytics on Hadoop
Future, see intent
Drives Optimization
BI Just getting started
Historical
Drives reporting
Entrenched
Be around for a long time