• Save
Enabling Key Business Advantage from Big Data through Advanced Ingest Processing - StampedeCon 2014
Upcoming SlideShare
Loading in...5

Enabling Key Business Advantage from Big Data through Advanced Ingest Processing - StampedeCon 2014



At StampedeCon 2014, Ronald Indeck (VelociData), "Enabling Key Business Advantage from Big Data through Advanced Ingest Processing." ...

At StampedeCon 2014, Ronald Indeck (VelociData), "Enabling Key Business Advantage from Big Data through Advanced Ingest Processing."

All too often we see critical data dumped into a “Data Lake” causing the data waters to stagnate and become a “Data Swamp”. We have found that many data transformation, quality, and security processes can be addressed a priori on ingest to enhance goodness and improve accessibility to the data. Data can still be stored in raw form if desired but this processing on ingest can unlock operational effectiveness and competitive advantage by integrating fresh and historical data and enable the full potential of the data. We will discuss the underpinnings of stream processing engines, review several relevant business use cases, and discuss future applications.



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Enabling Key Business Advantage from Big Data through Advanced Ingest Processing - StampedeCon 2014 Enabling Key Business Advantage from Big Data through Advanced Ingest Processing - StampedeCon 2014 Presentation Transcript

  • Enabling Key Business Advantage from Big Data through Advanced Ingest Processing Ronald S. Indeck, PhD President and Founder VelociData, Inc. Solving the Need for Speed in Big DataOps
  • www.velocidata.com info@velocidata.com Today’s Discussion • Motivations for Advanced Processing • Total Data Challenges • Economical Parallelism for IT is Arriving • Heterogeneous System Architectures (HSA) • HSA Implementation and Business Benchmarks • Questions
  • www.velocidata.com info@velocidata.com Big Data? 3
  • www.velocidata.com info@velocidata.com The Urgency for Gaining Answers in Seconds Companies that Embrace Analytics Accelerate Performance “Value Integrators” achieve higher business performance: ‒ 20 times the EBITDA growth ‒ 50% more revenue growth • “Large-scale data gathering and analytics are quickly becoming a new frontier of competitive differentiation” – HBR • The challenge for IT is to economically provide real time, quality data to support business analytics and meet time-bound service level requirements when data are doubling every 12 months Analytics is creating a competitive advantage 4
  • www.velocidata.com info@velocidata.com Recognizing “Total Data” Challenges • Bloor: Databases are more than adequate for the use cases they are designed to support • Consider Big Data AND Relational, not OR … think “Total Data” • The critical unsolved challenge is breaking Total Data flow bottlenecks 5 • Total Data challenges • Data volumes exploding • Data velocity and variety growing • Data must quickly move between disparate systems • Processing high volumes on mainframes is expensive • No spare resources for critical encryption / masking • Improving or measuring data quality is challenging
  • www.velocidata.com info@velocidata.com Conventional Approaches • Add more cores and memory to the existing platform • Push processing into MPP (Teradata, Netezza, …) • Change the infrastructure (Oracle Exadata, …) • Use distributed platforms (Hadoop, ...) These require new skills, time, capital, management, support, risk … and fail to truly solve the Total Data flow problem 6
  • www.velocidata.com info@velocidata.com Parallelism in IT Processing is Compelling • Amdahl’s Law • High Performance Computing history • Systems were expensive • Unique tools and training required • Scaling performance is often sub-linear • Issues with timing and thread synchronization HPC has struggled for 40 years to deliver widespread accessibility mostly due to cost and poor abstraction, development tools, and design environment If we could just deliver accessibility at an affordable cost … • Hardware is now becoming inexpensive • Application development improvements still needed to enable productivity  Abstract through implementation of streaming as the paradigm 7
  • www.velocidata.com info@velocidata.com Complementary Approach: Heterogeneous System Architecture • Leverage a variety of compute resources • Not just parallel threads on identical resources • Right resources at the right times • Functional elements use appropriate processing components where needed • Accommodate stream processing • Source  processing  target • Streaming data model enables pipelining, data flow acceleration • Embrace fine-grained pipeline / functional parallelism • Especially data / direct parallelism • Separate latency and throughput • Engineered system • Manage thread, memory, and resource timing and contention 8
  • www.velocidata.com info@velocidata.com Heterogeneous System Architecture General purpose “not bad at everything” - Good branch prediction, fast access to large memory Thousands of cores performing very specific tasks - Excellent matrix and floating point Fully customizable with extreme opportunities for parallelism - Excels at bit manipulation for regex, cryptography, searching, … 9 Standard CPUs Graphics Boards (GPUs) FPGA Coprocessors
  • www.velocidata.com info@velocidata.com • Compute “value at risk” for a portfolio • 1024 stocks • Evaluate using Monte Carlo simulation • Brownian motion random walk • Execute 1 million trials and aggregate results: 1 trial equals 1024 random walks • Double-precision computation Example: Risk Modeling Application 10
  • www.velocidata.com info@velocidata.com Example: Risk Modeling Performance Results • Baseline [CPU-only] • 450 thousand walks/second  37 minutes to execute 1 billion walks • FPGA + GPU + CPU • 140 million walks/second  6 seconds for 1 billion walks • Speedup of 370x • Other financial MC simulations are similar *First use of GPU, FPGA, and CPU in one application application stage 1 application stage 2 application stage 3 FPGA graphics engine chip multi- processor 11
  • www.velocidata.com info@velocidata.com • Bundles software, firmware, and hardware into an appliance • Delivers the right compute resource (CPU, GPU, and FPGA) to the right process at the right time • Uses other system resources effectively • High-level abstraction: no need to code, re-train, or acquire new skillsets • Promotes stream processing for real-time action • Sources  processing  targets • Streaming data model enables pipelining for data flow acceleration Stream Processing as an HSA Appliance 12
  • www.velocidata.com info@velocidata.com Example: VelociData Solution Palette 17 VelociData Suites VelociData Solutions Examples Conventional (records/second) VelociData (records/second) Data Transformation Lookup and Replace Data enrichment by populating fields from a master file, dictionary translations, etc. (e.g. CP  Cardiopulmonologist) 3000-6000 600,000 Type Conversions XML  Fixed; Binary  Char; Date/Time Formats 1000-2000 800,000 Format Conversions Rearrange, add, drop, merge, split, and resize fields to change layouts 1000-10,000 650,000 Key Generation Hash multiple field values into a unique key, (e.g. SHA-2) 3000-20,000 > 1,000,000 Data Masking Obfuscate data for non-production uses: Persistent or Dynamic; Format preserving; AES-256 500-10,000 > 1,000,000 Data Quality USPS Address Processing Standardization, verification, and cleansing (CASS certification in process) 600-2000 400,000 Domain Set Validation Validate a value based on a list of acceptable values (e.g., all product codes at a retailer; all countries in the world) 1000-3000 750,000 Field Content Validation Validates based on patterns such as emails, dates, and phone numbers 1000-3000 > 1,000,000 Data type validation and bounds checking 3000-6000 > 1,000,000 Data Platform Conversion Mainframe Data Conversion Copybook parsing & data layout discovery; EBCDIC, COMP, COMP-3, …  ASCII, Integer, Float,… 200-800 > 200,000 Data Sort Accelerated Data Sort Sort data using complex sort keys from multiple fields within records 7000-20,000 1,000,000 Results are system dependent but data intended to provide magnitude comparison
  • www.velocidata.com info@velocidata.com Example of Common ETL Bottlenecks Task #1 Task #2 Task #3 Task #4 Task #5 Task #6 Task #7 Task #8 Staging DB ETL Server Candidates for Acceleration Extract Transform Load CSV Mainframe XML RDBMS Social Media Sensor Hadoop • Hadoop • ETL Server • Data Warehouse • Database Appliances • BI Tools •Cloud
  • www.velocidata.com info@velocidata.com Example ETL Processes Offloaded 15 Task #6 Task #7 Task #8 Staging DB ETL Server Extract Transform Load Keep Existing Input Interfaces Remove Bottlenecks Reduce ETL Server Workload Faster Total Processing Time CSV Mainframe XML RDBMS Social Media Sensor Hadoop Task #1 Task #2 Task #3 Task #4 Task #5 • Hadoop • ETL Server • Data Warehouse • Database Appliances • BI Tools •Cloud
  • www.velocidata.com info@velocidata.com Example Mainframe-to-Hadoop Workflow • Simple, configuration-driven workflow • Sample shows Mainframe  HDFS • Data are validated, cleansed, reformatted, enriched, …, along the way • Enables landing analytics-ready data as fast as it can move across the wire • Workflow can also work in reverse to return processed data to the mainframe 16 Mainframe Input Validation Key Generation Formatter Lookup Address Standardization CSV Out
  • www.velocidata.com info@velocidata.com Wire-rate Platform Integration 17 Enable fast data access between systems MPP Platforms (e.g., Teradata) Format and improve data for ready insertion into Data Analytics architectures ETL Server Preprocess data for fast movement into and out of Data Integration tools Mainframe Conversion into and out of EBCDIC and packed decimal formats Hadoop Convert data to ASCII and improve quality in flight VelociData feeds Hadoop pre-processed, quality data for real-time BI efforts VelociData enables real-time data access by Teradata for operational analytics
  • www.velocidata.com info@velocidata.com Enabling Three Layers of Data Access VelociData delivers Hadoop pre-processed, quality data to keep “the lake” clean Hadoop VelociData enables real-time data access for immediate analytics and visualization VelociData feeds databases and warehouses pre-analytic, aggregated data for operational analytics • Sensors • Weblogs • Transactions • Mainframe • Hadoop • Social Media • RDBMS • … Wire-rate transformations and convergence of fresh and historical data 19
  • www.velocidata.com info@velocidata.com Accessing Realtime and Historical Data • Realtime Analysis for Competitive Advantage • Enabling the speed of business to match business opportunities • Integrating Historical Data for Operational Excellence • Informing traditional BI with real-time inputs 19 Conventional Batch-oriented BI Real-time Operational Analytics Iterative Modeling Business Excellence
  • www.velocidata.com info@velocidata.com Stream Processing AND Hadoop Leveraging stream processing with batch-oriented Hadoop • Access to more data for analytics • Process data on ingest (also land raw data if desired) • Transformation • Cleansing • Security • Never read a COBOL copybook again • Stream sort for integrating data, aggregation, and dedupe • … 20
  • www.velocidata.com info@velocidata.com Examples of Data Challenges Being Solved 21 • Pharmaceutical discovery query is reduced from 8 days to 20 minutes • Retailer now integrates full customer data from in-store, on-line, and mobile sources in real-time (processing 50,000 records/s, up from 100/s) • Property casualty company shortens by five-fold a daily task of processing 540 million records to enable more accurate real-time quoting • Credit card company reduces mainframe costs and improves analytics performance by integrating historical and fresh data into Hadoop at line rates • Financial processing network masks 5 million fields/s of production data to sell opportunity information to retailers • To enable better customer support, a health benefits provider shortens a data integration process from 16 hours to 45 seconds • Billions of records with multi-fields keys are sorted nearly a million records/s for analytics and data quality • USPS address standardization at 10 billion/hour for data cleansing on ingest
  • www.velocidata.com info@velocidata.com Thank You!
  • www.velocidata.com info@velocidata.com Questions?