Inicio > > Bases de datos > Diseño y teoría de bases de datos > Data Engineering with Scala and Spark
Data Engineering with Scala and Spark

Data Engineering with Scala and Spark

David Radford / Eric Tome / Rupam Bhattacharjee

55,63 €
IVA incluido
Disponible
Editorial:
Packt Publishing
Año de edición:
2024
Materia
Diseño y teoría de bases de datos
ISBN:
9781804612583
55,63 €
IVA incluido
Disponible
Añadir a favoritos

Take your data engineering skills to the next level by learning how to utilize Scala and functional programming to create continuous and scheduled pipelines that ingest, transform, and aggregate dataKey FeaturesTransform data into a clean and trusted source of information for your organization using ScalaBuild streaming and batch-processing pipelines with step-by-step explanationsImplement and orchestrate your pipelines by following CI/CD best practices and test-driven development (TDD)Purchase of the print or Kindle book includes a free PDF eBookBook DescriptionMost data engineers know that performance issues in a distributed computing environment can easily lead to issues impacting the overall efficiency and effectiveness of data engineering tasks. While Python remains a popular choice for data engineering due to its ease of use, Scala shines in scenarios where the performance of distributed data processing is paramount. This book will teach you how to leverage the Scala programming language on the Spark framework and use the latest cloud technologies to build continuous and triggered data pipelines. You’ll do this by setting up a data engineering environment for local development and scalable distributed cloud deployments using data engineering best practices, test-driven development, and CI/CD. You’ll also get to grips with DataFrame API, Dataset API, and Spark SQL API and its use. Data profiling and quality in Scala will also be covered, alongside techniques for orchestrating and performance tuning your end-to-end pipelines to deliver data to your end users. By the end of this book, you will be able to build streaming and batch data pipelines using Scala while following software engineering best practices.What you will learnSet up your development environment to build pipelines in ScalaGet to grips with polymorphic functions, type parameterization, and Scala implicitsUse Spark DataFrames, Datasets, and Spark SQL with ScalaRead and write data to object storesProfile and clean your data using DeequPerformance tune your data pipelines using ScalaWho this book is forThis book is for data engineers who have experience in working with data and want to understand how to transform raw data into a clean, trusted, and valuable source of information for their organization using Scala and the latest cloud technologies. Table of ContentsScala Essentials for Data EngineersEnvironment SetupAn Introduction to Apache Spark and Its APIs - DataFrame, Dataset, and Spark SQLWorking with DatabasesObject Stores and Data LakesUnderstanding Data TransformationData Profiling and Data QualityTest-Driven Development, Code Health, and MaintainabilityCI/CD with GitHubData Pipeline OrchestrationPerformance TuningBuilding Batch Pipelines Using Spark and ScalaBuilding Streaming Pipelines Using Spark and Scala

Artículos relacionados

  • Hands-On Machine Learning on Google Cloud Platform
    Alexis Perrier / Giuseppe Ciaburro / Kishore Ayyadevara
    Enhance your understanding of Computer Vision and image processing by developing real-world projects in OpenCV 3Key FeaturesGet to grips with the basics of Computer Vision and image processingThis is a step-by-step guide to developing several real-world Computer Vision projects using OpenCV 3This book takes a special focus on working with Tesseract OCR, a free, open-source libr...
    Disponible

    67,00 €

  • MLOps with Red Hat OpenShift
    Faisal Masood / Ross Brigoli
    Build and manage MLOps pipelines with this practical guide to using Red Hat OpenShift Data Science, unleashing the power of machine learning workflowsKey FeaturesGrasp MLOps and machine learning project lifecycle through concept introductionsGet hands on with provisioning and configuring Red Hat OpenShift Data ScienceExplore model training, deployment, and MLOps pipeline buildi...
    Disponible

    61,48 €

  • Data Labeling in Machine Learning with Python
    Vijaya Kumar Suda
    Take your data preparation, machine learning, and GenAI skills to the next level by learning a range of Python algorithms and tools for data labelingKey FeaturesGenerate labels for regression in scenarios with limited training dataApply generative AI and large language models (LLMs) to explore and label text dataLeverage Python libraries for image, video, and audio data analysi...
    Disponible

    83,55 €

  • Machine Learning Infrastructure and Best Practices for Software Engineers
    Miroslaw Staron
    Efficiently transform your initial designs into big systems by learning the foundations of infrastructure, algorithms, and ethical considerations for modern software productsKey FeaturesLearn how to scale-up your machine learning software to a professional levelSecure the quality of your machine learning pipeline at runtimeApply your knowledge to natural languages, programming ...
    Disponible

    62,67 €

  • Database Design and Modeling with Google Cloud
    Abirami Sukumaran
    Build faster and efficient real-world applications on the cloud with a fitting database model that’s perfect for your needsKey FeaturesFamiliarize yourself with business and technical considerations involved in modeling the right databaseTake your data to applications, analytics, and AI with real-world examplesLearn how to code, build, and deploy end-to-end solutions with exper...
    Disponible

    48,37 €

  • Data Stewardship in Action
    Pui Shing Lee
    Take your organization’s data maturity to the next level by operationalizing data governanceKey FeaturesDevelop the mindset and skills essential for successful data stewardshipApply practical advice and industry best practices, spanning data governance, quality management, and compliance, to enhance data stewardshipFollow a step-by-step program to develop a data operating model...
    Disponible

    68,38 €