Your SlideShare is downloading. ×
Fast parallel data loading with the bulk API
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Fast parallel data loading with the bulk API

454
views

Published on

Pouvez-vous charger 20 millions d'enregistrements dans Salesforce en moins d'une heure? Si ce n'est pas le cas, ce webinar est fait pour vous. …

Pouvez-vous charger 20 millions d'enregistrements dans Salesforce en moins d'une heure? Si ce n'est pas le cas, ce webinar est fait pour vous.

Published in: Technology

0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
454
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
30
Comments
0
Likes
3
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • STEVE

    Intros, Steve, Sean, team, connect with us on Twitter
  • STEVE

    Connect with Force.com on social networks, keep informed
  • STEVE

    Move toward the right with better knowledge about data loading
  • STEVE

  • STEVE

  • STEVE

  • STEVE

  • STEVE

    The number of shovels working at the same time on the same job.
  • STEVE

    Optimal v suboptimal?
    Serial is the baseline amount of work with one shovel at work
    Optimal is when the work/DOP is directly proportional to the time saved
    Same amount of work, just less time to complete it
    DOP= 4, time is 1/4
  • STEVE

    Suboptimal looks very different
    DOP and time saved are not proportional
    In fact, here we perform more work and save no time
    Why?
  • STEVE

    Throughput inhibitors, like locks, exceptions, etc.
    Like your friends all working too closely in the ditch banging each other with the shovels
  • Case study talk track
    ~~~~~~~~~~~~~

    The guys who did this work in the Customer Centric Engineering and happen upon this type of issue regularly.
    Customers have issues with parallelism and hit their throughput inhibitors.
  • Used by Salesforce and also you can use it to build custom dataload apps if required for large users with specific requirements.

    We will be showing code snippets using Java and the Web Service Connector for Salesforce. (just search for Salesforce WSC)
  • Load Orders – 1 million records : related to Account and Catalogue [500 accounts and 50 catalogues]

    Notice lookup relationships – we’ll focus on the significance of this later on
  • Recurring theme today: RIP
    For data loads that RIP, you need to learn how to Realize, Investigate, and Plan your data loads
  • STEVE

    To teach, we’ll use some Case Studies
    Sean built these from some of his work with customers, right Sean?

    SEAN

    Team function
    Enabling content
  • Intro, baseline for our work
  • STEVE

    Explain threads

    Load properties
    One job
    One thread
    Lots of potential threads available, but not being used because we are explicitly going Serial

    Expliquer le multi-tenants et la file d’attente batch
  • En profiter pour montrerWSC : Web Services Connector

  • The file we’re processing needs to be a maximum of 10 mill bytes per batch and also a max of 10 thou rows.
  • STEVE

  • SEAN

    Show results of serial run that was completed prior to demo –> Org Setup
  • STEVE

    Aligns with expectations
    Work comes from internal information we have
    Work/Runtime = DOP -- DEGREE OF PARALLELISM
    Baseline throughput for DOP = 1 is ~20K/m
    To increase throughput, we’ll try increasing the DOP
  • STEVE

    Taking baseline DOP = 1 and throughput of 20K/m, extrapolate the optimal load line up as DOP increases
    Let’s see how we do with parallel loads
  • STEVE

  • STEVE

    Same job, use multiple concurrent threads to process batches in parallel
    # threads available at any given time can vary during the run
    Why? Competing jobs in your org. Multitenant, so competing jobs from other orgs as well.

  • Montrer que l’objet order__c est vide
    Montrer la relation sur Account avec “don’t allow…”
    Lancer le job 02
    Montrer sa progression et les “retry”

  • STEVE

    Locks are going to hurt throughput
    Retry logic creates more work that hurts rather than helps job throughput
  • STEVE

    Failed load
    DOP higher
    More work
    Bad combination that can shoot yourself in the foot because of competition for available threads
  • STEVE

    DOP greater
    Highly suboptimal
  • STEVE

    What next Sean?

    SEAN
  • STEVE

    First way to solve the problem: eliminate locks that Sean mentioned

    Catalogue is as default and this will not do any validation
    Account has “Don’t allow deletion of the lookup record that’s part of a lookup relationship” – causes locks so set it to not have that and run

    Talk about 500 accounts and 1,000,000 order records randomly split, so 2000 records is all you need for this to have an issue…

    Explain the screen as loading – “In progress Batches” are the number of threads – no throttling – we are taking as much as we can.

    We changed the schema change to “fix” this.
    Truncate Order__C !!!

    What if I can’t do that. Discuss soon.
  • STEVE

    Excellent results
    All because of simple config change
    Great throughput and DOP
    No key problems
  • STEVE

    Right where we want to be
  • STEVE

    What if we can’t change our config?

    SEAN

    Elimination of locks by config change is way to go, when possible
    Otherwise, consider ordering your data to manage locks
  • Put the config back in the account lookup and then show sorting the data in Excel [DO NOT SAVE CSV FILE]
    Truncate the object
    Do parallelSORTED (03)
  • SEAN

    Talk about checklist and point to wiki article

    For a few special operations, Salesforce uses organization-wide group membership locks. To avoid lock exceptions when performing the following operations, you must use serial processing for your data load.

    Master-detail:
    Eviter d’avoir un ID master identique utilisé dans différents batch -> multiplication des verrous
    Lookup :
    Le cas vu pendant le webinar
    Triggers :
    Verrous quand les enregistrements lus ne sont pas ceux chargés ou mis à jour
    Workflow Rule :
    Problème de règle d’update de champ

    Adding users who are assigned to roles
    Changing users’ roles
    Adding a role to the role hierarchy or a territory to the territory hierarchy
    Changing the structure of the role hierarchy or the territory hierarchy
    Adding or removing members from public or personal groups, roles, territories, or queues
    Changing the owner of an account that has at least one community role or portal role associated with it to a new owner who is assigned to a different role than the original owner


    Et enfin : tenter de gérer les “retry” dans le traitement n’est pas une bonne stratégie en termes de performance.

  • Also mention SEARCH INDEXING when loading lots of data.

    There are variations in “Degrees of Parallelism” as it is multitenancy
  • STEVE
  • Goulet d’étranglements : extraction ET chargement

    Répartition/distribution des données – Eventuellement ordonner sur plusieurs colonnes.

    Réplication : chargement massif ou seulement en delta
  • SEAN

  • Montrer le site et insister sur “Force.com Data Management Design”
  • STEVE
  • 10,000 rows per batch – how many batches per hour can we do? Extremely large numbers of data, we can raise temporarily – see support

    What causes locks – see the wiki as there are lots of great detail

    Why can’t I use standard tools for data loads – why not use dataloader? > 5,000,000 records is not a good idea

    Updates? What about those versus inserts – should be similar to insert.

  • Transcript

    • 1. Salesforce API Series Fast Parallel Data Loading with the Bulk API July 15, 2014
    • 2. #forcewebinar Speaker Hervé Maleville Platform Specialist - France
    • 3. #forcewebinar Safe Harbor Safe harbor statement under the Private Securities Litigation Reform Act of 1995: This presentation may contain forward-looking statements that involve risks, uncertainties, and assumptions. If any such uncertainties materialize or if any of the assumptions proves incorrect, the results of salesforce.com, inc. could differ materially from the results expressed or implied by the forward-looking statements we make. All statements other than statements of historical fact could be deemed forward-looking, including any projections of product or service availability, subscriber growth, earnings, revenues, or other financial items and any statements regarding strategies or plans of management for future operations, statements of belief, any statements concerning new, planned, or upgraded services or technology developments and customer contracts or use of our services. The risks and uncertainties referred to above include – but are not limited to – risks associated with developing and delivering new functionality for our service, new products and services, our new business model, our past operating losses, possible fluctuations in our operating results and rate of growth, interruptions or delays in our Web hosting, breach of our security measures, the outcome of intellectual property and other litigation, risks associated with possible mergers and acquisitions, the immature market in which we operate, our relatively limited operating history, our ability to expand, retain, and motivate our employees and manage our growth, new releases of our service and successful customer deployment, our limited history reselling non- salesforce.com products, and utilization and selling to larger enterprise customers. Further information on potential factors that could affect the financial results of salesforce.com, inc. is included in our annual report on Form 10-Q for the most recent fiscal quarter . This documents and others containing important disclosures are available on the SEC Filings section of the Investor Information section of our Web site. Any unreleased services or features referenced in this or other presentations, press releases or public statements are not currently available and may not be delivered on time or at all. Customers who purchase our services should make the purchase decisions based upon features that are currently available. Salesforce.com, inc. assumes no obligation and does not intend to update these forward-looking statements.
    • 4. #forcewebinar Follow Developer Force for the Latest News @forcedotcom / #forcewebinar Developer Force – Force.com Community +Developer Force – Force.com Community Developer Force Developer Force Group
    • 5. How fast can you load data into Salesforce?
    • 6. How many records can you load into Salesforce in 1 hour?
    • 7. #forcewebinar Data load throughput - 5,000,000 10,000,000 15,000,000 20,000,000 25,000,000 OK Fast Faster Records/Hour
    • 8. Parallel processing
    • 9. #forcewebinar A parallel processing analogy: digging a ditch
    • 10. #forcewebinar Serial processing
    • 11. #forcewebinar Parallel processing
    • 12. The number of processes or threads associated with an operation.
    • 13. #forcewebinar Optimal parallel processing Serial Parallel 20M records 5M records 5M records 5M records 5M records Time
    • 14. #forcewebinar Sub-optimal parallel processing Serial Parallel Time 5M records 5M records 5M records 5M records 20M records
    • 15. #forcewebinar Locks, exceptions, triggers, relationships, … Serial Parallel Time 5M records 5M records 5M records 5M records 20M records Throughput inhibitors
    • 16. #forcewebinar Data load case studies  Get hands on with the Salesforce Bulk API  Contrast serial data loads vs. parallel data loads  Measure degrees of parallelism and throughput  Identify and avoid throughput inhibitors  Achieve maximum throughput
    • 17. Prep work
    • 18. #forcewebinar Salesforce Bulk API  Asynchronous data loading  Optimized for large data sets  REST API  Powers many tools  Use to build custom tools with any programming language (Java, etc.)
    • 19. #forcewebinar Demo schema
    • 20. Bulk API Loads that … ealize, nvestigate, and lan
    • 21. Case Studies
    • 22. Serial Data Load
    • 23. #forcewebinar Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Serial load: Expected plan Time • One job • 100 batches • 10,000 records/batch • 1M total records
    • 24. #forcewebinar Serial load: Job configuration
    • 25. #forcewebinar Serial load: Batch creation
    • 26. #forcewebinar Serial load: Batch run
    • 27. Demo Serial load
    • 28. #forcewebinar Serial load summary Concurrency Mode Serial Records Loaded 1 million Records Failed 0 Run Time 77 minutes Work Completed 75 minutes Throughput 13,000 records per minute Degree of Parallelism 0.97 Key Problem Degree of parallelism explicitly limited to ~1. Solution Explore parallel load for increased throughput.
    • 29. #forcewebinar Parallelism vs. Throughput of a Single Job 0 50000 100000 150000 200000 250000 300000 350000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Serial Serial Run • Low degree of parallelism Degree of Parallelism ThroughputRecords/Min
    • 30. Parallel data loads
    • 31. #forcewebinar Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Parallel load: Expected plan Time • One job • 100 batches • 10,000 records/batch • 1M total records
    • 32. #forcewebinar Parallel load: Job configuration
    • 33. Demo Parallel 1
    • 34. #forcewebinar Things to watch for  Locks can significantly affect parallel loads – Wasted processing capacity – Reduced throughput – Failures  Retry logic is not all its cracked up to be
    • 35. #forcewebinar Parallel load 1 summary Concurrency Mode Parallel Records Loaded 396,600 Records Failed 603,400 Run Time 17 minutes Work Completed 3 hours 15 minutes Throughput 22,000 records per minute Degree of Parallelism 11.5 Key Problem Lock Exceptions. Server worked significantly harder but no increase in throughput. Solution Run the load in serial mode or manage locks.
    • 36. #forcewebinar Parallelism vs. throughput of a single job 0 50000 100000 150000 200000 250000 300000 350000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Serial Parallel Run 1 • High degree of parallelism • Low throughput due to locks Degree of Parallelism ThroughputRecords/Min Parallel 1
    • 37. #forcewebinar Time to optimize  Let’s make your data load  ealize – Locks inhibit parallelism and throughput  nvestigate – What is causing the locks  lan – Manage the locks
    • 38. Demo Parallel load 2 Eliminate Locks by Modifying Schema
    • 39. #forcewebinar Parallel load: Sample results Concurrency Mode Parallel Records Loaded 1 million Records Failed 0 Run Time 3 minutes and 30 seconds Work Completed 1 hour Throughput 320,000 records per minute Degree of Parallelism 19 Key Problem None Solution n/a
    • 40. #forcewebinar Parallelism vs. throughput of a single job 0 50000 100000 150000 200000 250000 300000 350000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Serial Parallel Run 2 • High degree of parallelism • High throughput Degree of Parallelism ThroughputRecords/Min Parallel 2 Parallel 1
    • 41. #forcewebinar Locks can be managed by  Elimination  Ordering load file
    • 42. Demo Parallel load 3 Avoid Locks with Ordered Data
    • 43. #forcewebinar Managing locks … a discussion while we load  Master-detail relationships  Lookup relationships  Roll-up summary fields  Triggers  Workflow rules  Group membership locks*
    • 44. #forcewebinar Parallel load: Sample results Concurrency Mode Parallel Records Loaded 1 million Records Failed 0 Run Time 4 minutes Work Completed 1 hour Throughput 250,000 records per minute Degree of Parallelism 16.5 Key Problem Minimal overhead due to locks Solution Remove all unnecessary locks
    • 45. #forcewebinar Parallelism vs. throughput of a single job 0 50000 100000 150000 200000 250000 300000 350000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Serial Parallel Run 3 • High degree of parallelism • High throughput Degree of Parallelism ThroughputRecords/Min Parallel 2 Parallel 3 Parallel 1
    • 46. Controlled feed/parallel data loads
    • 47. #forcewebinar Controlled feed load methodology  Explicit throttling on parallelism and throughput – Parallel extraction and loading – Prioritization of asynchronous processing capacity  Manage inhibitors in complex jobs – Data Skews – Multiple Locks
    • 48. #forcewebinar Parallelism vs. throughput of a single job 0 50000 100000 150000 200000 250000 300000 350000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Serial Controlled Feed Run • Reduced parallelism • Expected throughput Degree of Parallelism ThroughputRecords/Min Parallel 2 Parallel 3 Controlled Feed Parallel 1
    • 49. #forcewebinar Related wiki article and Architect Core Resources
    • 50. #forcewebinar Recap  Make your parallel data loads  ealize – Locks inhibit parallelism and throughput  nvestigate – What is causing the locks  lan – Manage the locks
    • 51. Q & A #forcewebinar Hervé Maleville Platform Specialist - France