Product and Technology

Unstructured Data Now Generally Available in Snowflake, Processing with Snowpark in Public Preview

Unstructured Data Now Generally Available in Snowflake, Processing with Snowpark in Public Preview

We’re excited to announce the general availability of the unstructured data management functionality in Snowflake. We launched public preview of this functionality in September 2021, and since then we have seen adoption by customers across industries for a variety of use cases. These use cases include storing and securing call center recordings, securely sharing PDF documents in Snowflake Data Marketplace, storing medical images and extracting data from them, and many more.

The partner ecosystem for unstructured data continues to grow. We have ML partners like Clarifai, Impira, Labelbox, Semantic Health, and Veritone that can help customers derive valuable insights from their unstructured data. We also have partners like Hammerspace that can provide additional data management capabilities.

Snowpark for unstructured data now in public preview

For customers who want a single, centralized repository to store and manage multiple types of data, not only can they store and govern unstructured data in Snowflake, they can also process that data externally with External Functions or natively with Snowpark and Java, now in public preview.

Snowpark is Snowflake’s new developer framework that natively supports Scala, Java, and Python (in preview) with full control over libraries, eliminating the need for separate processing engines. By enabling teams to collaborate on the same data in a single platform, customers can streamline their architecture and enable a wide variety of new use cases.

“We have a project to apply text analytics on vast amounts of email data with attachments. We were storing email body separately and attachments as binary in the database, which came with challenges. Attachments can exceed column storage limitations and the original email needs to be stored on disk to be accessed again. Snowflake’s support for unstructured data allows us to have all of the data and processing in one place and build rich datasets for machine learning in various use cases. We can store email files in their original format in a Snowflake-managed stage and process them using Snowflake’s engine with Java UDFs.”

 — Eranga, VP Data Science in a leading software company

Processing unstructured data using Snowpark

Using Streams and Tasks, and directory tables, customers can build continuous data pipelines to process unstructured data. The actual processing of files can be done on Snowflake compute using Java functions and Snowpark. 

Data engineers, data scientists, and developers can create Java user-defined functions to:

  • Extract text from documents.
  • Process emails, extract metadata, and extract attachments.
  • Process medical images to extract patient information stored in them.

Imagine a healthcare provider has doctor notes stored in PDF or image format, and they need to extract fields from these files into structured tables. Within Snowflake, this healthcare provider can create a Java function to extract data from PDFs or image files, which can be called in SQL queries or pipelines for continuous processing with Snowflake’s engine.

Here’s an example of what such an architecture would look like:

Get started

Try out these new capabilities by following along with this quickstart, which walks you through step by step how to store, govern, process, and share unstructured data with Snowflake. You can also explore the product documentation here. We’re always looking for ways to improve, so if you have feedback about Snowpark, share it with us in our dedicated discussion forum.

Share Article

Stream Rows and Kafka Topics Directly into Snowflake with Snowpipe Streaming

Today we are happy to announce the public preview of Snowpipe Streaming as the latest addition to our Snowflake ingestion offerings. Read more.

Introducing Cortex AISQL: Reimagining SQL into AI Query Language for Multimodal Data

Cortex AISQL (public preview) transforms Snowflake SQL into an AI query language so users can build AI pipelines using familiar commands across multimodal data.

Analyze Your Query Performance Like Never Before with Programmatic Access to Query Profile

The Query Profile has been available in Snowsight, and we are excited to announce the Public Preview of programmatic access to Query Profile. Learn more here.

Secure Data Collaboration with Data Clean Rooms - Blog

Snowflake data clean room enables sharing without showing. Learn more about use cases across industries for which secure multi-party computation is valuable.

%%title%% %%page%% | Blog

Snowflake is adding support for unstructured data to allow customers to deliver more use cases with a single platform.

Snowpark for Python: Processing Files and Unstructured Data

Now in Public Preview: Process files & unstructured data with Snowpark for Python. Secure, unified processing & open source library support.

Snowpark Use Cases for Data Science

Snowflake partners have taken the lead in leveraging data programmability advancements to enhance users’ experience across many steps of the data science workflow.

Data Replication into Snowflake with Oracle GoldenGate | Snowflake Blog

There a many different use cases for continuous data replication into Snowflake, including storing all transactional history in a data lake.

Snowflake Support for Microsoft’s ADLS Gen2: Public Preview

Subscribe to our blog newsletter

Get the best, coolest and latest delivered to your inbox each week

Where Data Does More

  • 30-day free trial
  • No credit card required
  • Cancel anytime