• Like
  • Save
Farklı Ortamlarda Büyük Veri Kavramı -Big Data by Sybase
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Farklı Ortamlarda Büyük Veri Kavramı -Big Data by Sybase

  • 582 views
Published

Sybase 'in büyük veri olarak adlandırılan Big Data kavramı için son derece geniş v

Sybase 'in büyük veri olarak adlandırılan Big Data kavramı için son derece geniş v

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
582
On SlideShare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
0
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. BIG DATA ANALYTICS IN A HETEROGENEOUS WORLDJOYDEEP DASDIRECTOR, ANALYTICS PRODUCT MANAGEMENTSYBASE INC, AN SAP COMPANYFEBRUARY 16, 2012
  • 2. AGENDA The real world means businessChange is afoot – Myriad solution trendsBuilding bridges across a heterogeneous world Summary
  • 3. BIG DATA ANALYTICSREAL WORLD MEANS BUSINESS
  • 4. BIG DATA ANALYTICS ISSUESDEALING WITH VOLUME, VARIETY, VELOCITY, COSTS, SKILLS Volume Managing and harnessing massive data sets Skills Variety Lack of adequate BIG Harmonizing silos of skills for popular structured and APIs DATA unstructured data ANALYTICS Costs Velocity Too expensive to Keeping up with acquire, operate, unpredictable data and expand and query flows
  • 5. BIG DATA ANALYTICS MATURITYFROM JARGON TO TRANSFORMATIONAL BUSINESS VALUE* New Strategies & Business Models Column Store Hadoop Big data NoSQL In memory Business data MPP Value* Operational Revenue Efficiencies Growth *A McKinsey study titled “Big Data: Next frontier for innovation, competition, and productivity”, May 2011, has found huge potential for Big Data Analytics with metrics as impressive as 60% improvements in Retail operating margins, 8% reduction in (US) national healthcare expenditures, and $150M savings in operational efficiencies in European economies
  • 6. BIG DATA ANALYTICS IN THE REAL WORLDPREVALENT IN DATA INTENSIVE VERTICALS AND FUNCTIONAL AREAS BIG DATA Verticals ANALYTICS Functional Banking • Marketing Analytics Digital channels Track visits, discover best channel mix: Telcom, email, social media, search • Sales Analytics Global Capital Markets Deep correlations Predict risks based on deal DNA (emails, Retail meetings) pattern match • Operational Analytics Government Atomic machine data Analyze RFIDs, weblogs, SMS, sensors — continuous operational inefficiency Healthcare • Financial Analytics Detailed simulations Information Providers Liquidity, portfolio simulations — Stress tests, error margins
  • 7. CHANGE IS AFOOTMYRIAD OF BIG DATA ANALYTICS SOLUTIONS
  • 8. CAUSAL LINKS: VARIETY, VELOCITY, VOLUME Events data Transactional data µSeconds Multi-media data eCommerce data Continuous and/or Bursts Routinely Petabytes x w a y z Graph data Variety Velocity Volume
  • 9. GROWING USER COMMUNITIES Data Scientists Business Analysts Developers/Programmers Administrators Business users External consumers Business Processes
  • 10. HARDWARE IS SUPERIOR Small Server farms – Scale out Larger Servers with partitions – Scale up Spinning disks to SSDs SSD 1.2x to 2x speed up SSD SSDs to Main Memory 4x to 200x speed up Main Memory to CPU caches CPU Caches 2x to 6x speed up
  • 11. SOFTWARE EXPECTATIONS HAVE CHANGED Intelligence & Automation Execution Characteristics Performance & Scalability Results Characteristics Traditional Contemporary
  • 12. EXECUTION CHARACTERISTICSPERFORMANCE FOCUS 1 2 3 4 5 6 7 8 9… r1 r2 r3 r4 r5 Columnar access MPP: Shared Nothing, Shared Everything Algorithms Computations close to data: InDB Analytics (MapReduce), FPGA filtering In-memory processing
  • 13. EXECUTION CHARACTERISTICSSCALABILITY FOCUS 1 2 3 4 5 6 7 8 9… r1 Data Compression r2 r3 r4 r5 Natural Compression Compression Techniques Hybrid Columnar Column Store Databases Row Store Databases Compression Databases SAN Distributed File Systems DAS NAS Stream Processing Engines Data Filtering Pre-processing Engines Transformation Engines
  • 14. EXECUTION CHARACTERISTICSINTELLIGENCE FOCUS Query & Load Optimization On-demand systems: Virtualization and provisioning CPU Caches CPU Cache Conscious Computations
  • 15. EXECUTION CHARACTERISTICSAUTOMATION FOCUS Data conscious federation Automatic Workload Balancer/Mixer User community focused collaborative services
  • 16. RESULTS CHARACTERISTICSACCURACY TOLERANCE FOCUS  Complex schemas Multiple applications Write on schema  Atomic level locking  Consistency guarantees across system losses  Declarative API  Interactive  Does encapsulate elements of CAP Traditional  Associated with SQL Simple read on schemas  Single application Batch oriented  Snapshot isolations  Eventual consistency guarantees  Procedural APIs  Does encapsulate elements of ACID Contemporary Associated with NoSQL
  • 17. BUILDING BIG DATA BRIDGESACROSS A HETEROGENEOUS WORLD
  • 18. COMPREHENSIVE 3-TIER FRAMEWORKCOMMERCIAL AND/OR OPEN SOURCE Eco-System Business Intelligence Tools, Data Integration Tools, DBA Tools, Packaged Apps Application Services In-Database Analytics, Multi-lingual Client APIs, Federation, Web Enabled Data Management High Performance, Highly Scalable, Cloud Enabled
  • 19. RELIABLE DATA MANAGEMENT Full Mesh High Speed Interconnect DataManagement  Can handle high performance, compression, batch, ad-hoc analysis  Can routinely scale to Petabyte class problems, thousands of concurrent jobs  Typical characteristics  Massively parallel processing of complex queries  In-memory and on-disk optimizations  Elastic resources for user communities  ACID guarantees  Data variety  Information lifecycle management  User friendly automation tools  File systems (schema free) and/or DBMS structures (schema specific)
  • 20. DATA MANAGEMENT INFRASTRUCTUREROBUST, SCALABLE, HIGH PERFORMANCE Data Discovery Application Modeling Reports/Dashboards Business Decisions (Data Scientists) (Business Analysts) (BI Programmers) (Business End Users) Infrastructure Management Full Mesh High Speed Interconnect (DBAs) • Dynamic, elastic MPP grid – Grow, shrink, provision on-demand – Heavy parallelization • Load, prepare, mine, report in a workflow – Privacy through isolation of resources – Collaboration through sharing of results/data via sharing of resources
  • 21. VERSATILE APPLICATION SERVICES Python ADO.NET PERL Programming PHP Ruby Java C++ APIs Web Services APIApplication Services In-Database Analytics Plug-Ins: SQL, PMML, C++, JAVA, …  Comprehensive declarative and procedural APIs  In-Database Analytics Plug-In APIs  In-Database Web Services  Query and data federation APIs  Multi-lingual Client APIs
  • 22. VERSATILE APPLICATION SERVICESRICH ALGORITHMS CLOSE TO DATA Sybase IQ Process In Memory Sybase IQ Process RPC CALLS In Memory User’s DLL “A” User’s DLL “B” Library Access Process LOAD User’s DLL “A” User’s DLL “B” LOAD User’s DLL “B” User’s DLL “B” In-database + In-process Multi-lingual APIs • In-process dynamically loaded In-database + Out-process Scalar to Scalar shared libraries • Out of process shared library Scalar sets to Aggregate • Highest possible performance Scalar sets to Dimensional • Lower security risks • Incurs security risks, but Aggregates manageable via privileges • Lower robustness risks Scalar sets to Multi-attribute (bulk) • Incurs robustness risks, but • Lower performance than in- Multi-attribute (bulk) to manageable via multiplex process but better than out of Multi-attribute (bulk) database
  • 23. VERSATILE APPLICATION SERVICESNATIVE MAPREDUCE For stocks in enterprise software sector, find max relative strength of a stock for a trading day* Key (k1) Value (v1) Key (k2) Value (v2) Ticker 30-min interval Weighted variance = (A given stock’s variance 30-min Ticker TickValu TickValue Symbol time / Average Variance across All “N” stocks) interval time Symbol e Day 1 Day 2 SAP 9:30 am +1.4 / (SUM (+1.4-2.8-0.7….)/”N” stocks) 9:30 am SAP 51 52.4 SAP 10:00 am +2.2 / (SUM (+2.2-2.3-1.1 ….)/”N” stocks) 9:30 am ORCL 31 28.2 Map SAP …… …… 9:30 am TDC 22 21.3 Fn ORCL 9:30 am -2.8 / (SUM (+1.4-2.8-0.7….)/”N” stocks) ORCL 10:00 am -2.3 / (SUM (+2.2-2.3-1.1 ….)/”N” stocks) 10:00 am SAP 50.9 53.1 ORCL ……. ….. 10:00 am TDC 21.8 20.9 TDC 9:30 am -0.7 / (SUM (+1.4-2.8-0.7….)/”N” stocks) 10:00 am ORCL 29.4 27.1 TDC 10:00 am -1.1/ (SUM (+2.2-2.3-1.1 ….)/”N” stocks) ….. ORCL …… ….. TDC ….. …… Reduce Fn Value (v3) Ticker Symbol Max Absolute Weighted Variance (v3) SAP Max (ABS(9:30 Wt Var), ABS(10:00 Wt Var), …..) ORCL Max (ABS(9:30 Wt Var), ABS(10:00 Wt Var), …..) TDC Max (ABS(9:30 Wt Var), ABS(10:00 Wt Var), …..) *Calculate max variance for the day by comparing each 30-min interval tick values across two days: the trading day & the day before, weighted by average variance of all stocks for each 30-min interval
  • 24. VERSATILE APPLICATION SERVICESNATIVE MAPREDUCE – DECLARATIVE WAY For stocks in enterprise software sector, find max relative strength of a stock for a trading day • Map function declaration: CREATE PROCEDURE MapVarTPF (IN XY TABLE (a1 char, a2 datetime, a3 float, a4 float) RESULT SET YZ (b1 char, b2 datetime, b3 float) • Reduce function declaration: CREATE PROCEDURE RedMaxVarTPF (IN XY TABLE (a1 char, a2 datetime, a3 float) RESULTE SET YZ (b1 char, b2 float) • Query: SELECT RedMaxVarTPF.TickSymb, RedMaxVarTPF.MaxVar, FROM RedMaxVarTPF (TABLE (SELECT MapVarTPF.TickSymb, MapVarTPF.30MinIntTime, MapVarTPF.Var FROM MapVarTPF (TABLE ( SELECT TickDataTab.TickSymb, TickDataTab.30MinIntTime, TickDataTab.30MinValDay1, TickDataTab.30MinValDay2) OVER (PARTITION BY TickDataTab.30MinInt))) OVER (PARTITION BY MapVarTPF.TickSymb)) ORDER BY RedMaxVarTPF.TickSymb • Native MapReduce parallel execution workflow: MapVarTPF (Partitioned to 15 parallel instances) RedMaxVarTPF (Partitioned to 25 parallel instances) SQL Query collates output using 1 node ……. ……. ….. SAN Fabric SAN Fabric SAN Fabric • Native MapReduce with unstructured data: Native MapReduce using can easily be applied to unstructured data also e.g. text, multi-media, … stored in DBMS or to unstructured data brought into DBMS during execution time from external files
  • 25. RICH ECO-SYSTEM Source Answers Data preparation Data Usage Eco-System DBMS / Filesystem Event Processing Data Federation Business Intelligence Data Modeling / Database Design Tool  Business Intelligence Tools  Data Integration Tools  Data Mining Tools  Application Tools  DBA Tools
  • 26. RICH ECO-SYSTEMDBMS <–> HADOOP BRIDGE I Feature Characteristics Big Data Use Cases • Client tool capable of querying •Ideal for bringing together Big Data DBMS and Hadoop Analytics pre-computations from different domains • Better performance when results from sources are pre- • Example – In Telecommunication: DBMS Client Side Federation: Join data computed/pre-aggregated has aggregated customer loyalty data & Hadoop with aggregated networkfrom DBMS AND Hadoop at a client utilization data; Quest Toad for Cloud can application level bring data from both sources, linking customer loyalty to network utilization or network faults (e.g. dropped calls) Quest Toad for Cloud DBMS Hadoop/Hive
  • 27. RICH ECO-SYSTEMDBMS <–> HADOOP BRIDGE II Feature Characteristics Big Data Use Cases • Extract & load subsets of HDFS data into DBMS store • Raw data from HDFS • Results of Hadoop MR jobs • Ideal for combining subsets of HDFS ETL unstructured data or summary of • HDFS Data stored in DBMS is HDFS data into DBMS for mid to long treated like other DBMS data term usage in business reports Load Hadoop Data into DBMS • Gets ACID properties of a DBMScolumn store: Extract, Transform, • Can be indexed, joined, parallelized • Example – In eCommerce: clickstream data • Can be queried in an ad-hoc way from weblogs stored in HDFS and outputs of Load data from HDFS (Hadoop MR jobs on that data (to study browsingDistributed File System) into DBMS • Visible to BI and other client tools behavior) ETL’d into DBMS. The transactional schemas sales data in DBMS joined with clickstream data via DBMS ANSI SQL API only to understand and predict customer browsing to buying behavior • Currently, the bulk data transfer utility SQOOP (built by Cloudera) is can be used provide this ETL capability Clickstream Data Sales Data Hadoop/Hive SQOOP DBMS
  • 28. RICH ECO-SYSTEMDBMS <–> HADOOP BRIDGE III Feature Characteristics Big Data Use Cases • Scan and fetch specified data • Ideal for combining subsets of HDFS subsets from HDFS via table UDF data with DBMS data for operational • Can read and fetch HDFS data subsets • Called as part of SQL query (transient) business reports • Output joinable with DBMS dataJoin HDFS data with DBMS data on • Multiple, simultaneous UDF calls possible • Example – In Retail: Point Of Sale (POS) the fly: Fetch and join subsets of • Sample UDFs provided in JAVA, C++ detailed data stored in HDFS. DBMS EDW fetches POS data at fixed intervals from HDFS of HDFS data on-demand using SQL queries from DBMS(Data • HDFS data not stored in DBMS specific hot selling SKUs, combines with inventory data in DBMS to predict and prevent • Fetched into DBMS In-memory tables inventory “stockouts”. Federation technique) • ACID properties not applicable • Repeated use: put fetched data in tables • Visible to BI/other client tools via ANSI SQL API only Inventory POS Data Data Hadoop/HDFS UDF Bridge DBMS
  • 29. RICH ECO-SYSTEMDBMS <–> HADOOP BRIDGE IV Feature Characteristics Big Data Use Cases • Trigger and fetch Hadoop MR job • Ideal for combining results of results via table UDF Hadoop MR job results with DBMS • Can trigger Hadoop MR jobs • Called as part of Sybase IQ SQL query data for operational (transient) • Output joinable with Sybase IQ data business reports • No multiple, simultaneous UDF calls • Sample UDFs provided in JAVA onlyCombine results of Hadoop MR jobs • Example – In Utilities: Smart meter and with DBMS data on the fly: Initiate smart grid data can be combined for load and Join results of Hadoop MR jobs • HDFS data not stored in DBMS monitoring and demand forecast. Smart grid on-demand using SQL queries from • Fetched into DBMS In-memory tables transmission quality data (multi-attribute time • ACID properties not applicable series data) stored in HDFS can be computed DBMS data (Query Federation • Repeated use: put fetched data in tables via Hadoop MR jobs triggered from DBMS and technique) combined with Smart meter data stored in DBMS to analyze demand and workload. • Visible to BI and other client tools via DBMS ANSI SQL API only Smart Grid Smart Meter Transmission Data consumption data Hadoop/HDFS UDF Bridge DBMS
  • 30. RICH ECO-SYSTEMDBMS <–> PREDICTIVE TOOLS BRIDGE Express Complex Computations In Industry Standard Predictive Modeling Markup Language (PMML), Plug In Models Close To data for execution Database Server DBMS SQL Applications Bridge Universal Predictions Plug-In PMML UDFs PMML PMML PMML (models) PMML Preprocessor (models) (models) (convert & validate)
  • 31. RICH ECO-SYSTEMFUNDAMENTALS OF STREAMS TECHNOLOGY Process data without storing it Input Streams Events arrive on input streams Derived Streams, Windows Apply continuous query operators to one or more input streams to produce a new stream Continuous Queries create a new Windows can Have State “derived” stream or window • Retention rules define how many or how long events are kept SELECT FROM one or more input • Opcodes in events can indicate streams/windows insert/update/delete and can be WHERE… automatically applied to the window GROUP BY…
  • 32. RICH ECO-SYSTEMSTREAMS DATA PROCESSING VS TRADITIONAL DATA PROCESESSING SQL CCL Windows on Tables Event Streams Rows Events Columns Fields On-Demand: query Event-Driven: runs when information query updates when is needed information arrives
  • 33. RICH ECO-SYSTEMSTREAMS PRE-PROCESSING Why store Big Data when you can deal with Small Data – Pre-filter un-necessary data on the fly with Streams technologies ESP Engine Alerts Actions Updates Memory Disk Hadoop/HDFS DBMS
  • 34. SUMMARY
  • 35. 3-LAYER LOGICAL INTEGRATION STREAM PROCESSING <-> NoSQL <-> DBMS BI TOOLS DI TOOLS DBA TOOLS DATA MINING TOOLSEco-System Unstructured Data App Ingest + Persist (Hadoop, Services Web 2.0 Java C/C++ SQL Federation Content Mgmt) Structured Data (DBMS) DMBS Streaming Data (ESP) The heterogeneous world will require co-existence and playing nice!
  • 36. Q&ALearn More: http://www.sybase.com/sybaseiqbigdataContact: 1-800-SYBASE5 (792.2735)